我通往可靠且令人愉悦的本地托管语音助手之旅。

我通往可靠且令人愉悦的本地托管语音助手之旅。
My Journey to a reliable and enjoyable locally hosted voice assistant

原始链接: https://community.home-assistant.io/t/my-journey-to-a-reliable-and-enjoyable-locally-hosted-voice-assistant/944860

## 本地Home Assistant语音控制：深度解析这篇文章详细记录了一位用户用完全本地化的Home Assistant语音控制系统替代Google Home的经历，使用了Assist、llama.cpp和定制硬件。由于对Google性能下降和隐私问题的担忧，该用户旨在寻找一种可靠、私密且功能丰富的替代方案。该设置包括一台带有USB4 eGPU（测试过RTX 3050-3090 & RX 7900XTX/9060XT – 更快的GPU能带来更快的响应速度）的Beelink MiniPC，Home Assistant运行在UnRaid上，以及Voice Preview Edition卫星设备。成功的关键在于超越默认Ollama模型，使用具有更高量化的HuggingFace GGUF模型来改善工具调用。大量精力投入到构建详细的LLM提示词（通过Gist分享），以确保准确的天气预报、商业信息、通用知识以及通过`llm-intents`等集成实现音乐播放。一个定制的唤醒词（“Hey Robot”）使用专门的仓库进行训练。虽然复杂，但最终结果是一个高度可定制且响应迅速的本地语音助手，满足了用户的需求，与他们以前的云解决方案相比，提供了更高的隐私性和可靠性。用户强调这不是一个即插即用的解决方案，需要耐心和研究，但它提供了一条通往强大、本地控制的智能家居体验的道路。

这个Hacker News讨论围绕着语音助手和用户体验的现状。发帖者分享了他们使用Home Assistant构建一个可靠、本地托管的语音助手的经历。评论者表达了期待和怀疑的混合情绪。一位用户正在等待OpenAI 2024演示中承诺的进步，并注意到缺乏明显的进展。其他人质疑语音助手的普遍吸引力，认为它们比直接交互更慢、更笨拙——尽管承认它们在辅助功能方面的价值。然而，有些人*确实*觉得它们有用，尤其是在计算机不易获取的情况下，可以快速执行诸如设置提醒和日历事件之类的任务。一个关键点是“情境感知”的潜力——新的助手能够理解其环境（例如，知道正在引用哪个灯）——超越现有的Siri等选项。

原文

I have been watching HomeAssistant’s progress with assist for some time. We previously used Google Home via Nest Minis, and have switched to using fully local assist backed by local first + llama.cpp (previously Ollama). In this post I will share the steps I took to get to where I am today, the decisions I made and why they were the best for my use case specifically.

Links to Additional Improvements

Here are links to additional improvements posted about in this thread.

New Features

Fixing Unwanted HA / LLM Behaviors

Optimizing Performance

Hardware Details

I have tested a wide variety of hardware from a 3050 to a 3090, most modern discrete GPUs can be used for local assist effectively, it just depends on your expectations of capability and speed for what hardware is required.

I am running HomeAssistant on my UnRaid NAS, specs are not really important as it has nothing to do with HA Voice.

Voice Hardware:

1 HA Voice Preview Satellite
2 Satellite1 Small Squircle Enclosures
1 Pixel 7a used as a satellite/ hub with View Assist

Voice Server Hardware:

Beelink MiniPC with USB4 (the exact model isn’t important as long as it has USB4)

USB4 eGPU enclosure

GPUs

The below table shows GPUs that I have tested with this setup. Response time will vary based on the model that is used.

GPU	Model Class	Response Time (after prompt caching)	Notes
RTX 3090 24GB	20B-30B MoE, 9B Dense	1 - 2 seconds	Efficiently and quickly runs models that are optimal for this setup.
RX 7900XTX 24GB	20B-30B MoE, 9B Dense	1 - 2 seconds	Efficiently and quickly runs models that are optimal for this setup.
RTX 5060Ti 16GB	20B MoE, 9B Dense	1.5 - 3 seconds	Quick enough to run models that are optimal for this setup with responses < 3 seconds.
RX 9060XT 16GB	20B MoE, 9B Dense	1.5 - 4 seconds	Quick enough to run models that are optimal for this setup with responses < 4 seconds.
RTX 3050 8GB	4B Dense	3 seconds	Good for running small models with basic functionality.

Models

The below table shows the models I have tested using this setup with various features and their performance.

All models below are good for basic tool calling. Advanced features are listed with the models quality at reliably reproducing the desired behavior.

(1) Handles commands like “Turn on the fan and off the lights”
(2) Understands when it is in a particular area and does not ask “which light?” when there is only one light in the area, but does correctly ask when there are multiple of the device type in the given area.
(3) Is able to parse misheard commands (ex: “turn on the pan”) and reliably execute the intended command
(4) Is able to reliably ignore unwanted input without being negatively affected by misheard text that was an intended command.

Voice Server Software:

Model Runner:

llama.cpp is recommended for optimal performance, see my reply below for details.

Speech to Text (Voice In):

The following are Speech to Text options that I have tested:

Software	Model	Notes
Wyoming ONNX ASR	Nvidia Parakeet V2	Specifically running via the OpenVINO branch which optimizes CPU inference time down to ~ 0.3 seconds
Rhasspy Faster Whisper	Nvidia Parakeet V2	Slower due to running directly via ONNX CPU which is slower than OpenVINO

Text to Speech (Voice Out):

Software	Notes
Kokoro TTS	Provides ability to mix and match multiple voices / tones to get desired output. Handles all text well.
Piper running on CPU (TTS)	Has multiple voices which can be picked from, works for general text but struggles with currency, phone numbers, and addresses.

Home Assistant LLM Integrations

LLM Conversation Provides improvements to the base conversation to improve default experience talking with Assist
LLM Intents to provide additional tools for Assist (Web Search, Place Search, Weather Forecast)

The Journey

My point in posting this is not to suggest that what I have done is “the right way” or even something others should replicate. But I learned a lot throughout this process and I figured it would be worth sharing so others could get a better idea of what to expect, pitfalls, etc.

The Problem

Throughout the last year or two we have noticed that Google Assistant through these Nest Minis has gotten progressively dumber / worse while also not bringing any new features. This is generally fine as the WAF was still much higher than not having voice, but it became increasingly annoying as we were met with more and more “Sorry, I can’t help with that” or “I don’t know the answer to that, but according to XYZ source here is the answer”. It generally worked, but not reliably and was often a fuss to get answers to arbitrary questions.

Then there is the usual privacy concern of having online microphones throughout your home, and the annoyance that every time AWS or something else went down you couldn’t use voice to control lights in the house.

Starting Out

I started by playing with one of Ollama’s included models. Every few weeks I would connect Ollama to HA, spin up assist and try to use it. Every time I was disappointed and surprised by its lack of abilities and most of the time basic tool calls would not work. I do believe HA has made things better, but I think the biggest issue was my understanding.

Ollama models that you see on Ollama are not even close to exhaustive in terms of the models that can be run. And worse yet, the default :4b models for example are often low quantization (Q4_K) which can cause a lot of problems. Once I learned about the ability to use HuggingFace to find GGUF models with higher quantizations, assist was immediately performing much better with no problems with tool calling.

Testing with Voice

After getting to the point where the fundamental basics were possible, I ordered a Voice Preview Edition to use for testing so I could get a better idea of the end-to-end experience. It took me some time to get things working well, originally I had WiFi reception issues where the ping was very inconsistent on the VPE (despite being next to the router) and this led to the speech output being stuttery and having a lot of mid-word pauses. After adjusting piper to use streaming and creating a new dedicated IoT network, the performance has been much better.

Making Assist Useful

Controlling device is great, and Ollama’s ability to adjust devices when the local processing missed a command was helpful. But to replace our speakers, Assist had to be capable of the following things:

Ability to give Day and Week Weather Forecasts
Ability to ask about a specific business to get opening / closing times
Ability to do general knowledge lookup to answer arbitrary questions
Ability to play music with search abilities entirely with voice

At first I was under the impression these would have to be built out separately, but I eventually found the brilliant llm-intents integration which provides a number of these services to Assist (and by extension, Ollama). Once setting these up, the results were mediocre.

The Importance of Your LLM Prompt

For those that want to see it, here is my prompt.

gist.github.com

https://gist.github.com/NickM-27/b83d2c8434cb7b01f27adf85638b1df1

system-prompt.md

# Identity

You are 'Robot', a versatile AI assistant. You serve as the primary interface for the home, providing both expert device control and comprehensive information on any subject imaginable. 

The user's home location is {{ states("sensor.home_city_state") }}.

You speak in a natural, conversational tone suitable for text-to-speech: concise, clear, and professional. Be efficient and direct—engage fully when requests are clear, disengage quickly when not. You may include light personality when appropriate.

# Response Format

This file has been truncated. show original

This is when I learned that the prompt will make or break your voice experience. The default HA prompt won’t get you very far, as LLMs need a lot of guidance to know what to do and when.

I generally improved my prompt by taking my current prompt and putting it into ChatGPT along with a description of the current behavior and desired behavior of the LLM. Then back-and-forth attempts until I consistently got the desired result. After a few cycles of this, I started to get a feel of how to make these improvements myself.

I started by trying to get weather working, the first challenge was getting the LLM to even call the weather service. I have found that having dedicated # sections for each service that is important along with a bulleted list of details / instructions works best.

Then I needed to make the weather response formatted in a way that was desirable without extra information. At first, the response would include extra commentary such as “sounds like a nice summery day!” or other things that detracted from the conciseness of the response. Once this was solved, a specific example of the output worked best to get the exact response format that was desired.

For places and search, the problem was much the same, it did not want to call the tool and instead insisted that it did not know the user’s location or the answer to specific questions. This mostly just needed some specific instructions to always call the specific tool when certain types of questions were asked, and that has worked well.

The final problem I had to solve was emojis, most responses would end with a smiley face or something, which is not good to TTS. This took a lot of sections in the prompt, but overall has completely removed it without adverse affects.

Solving Some Problems Manually

NOTE: Not sure if a recent Home Assistant or Music Assistant update improved things, but the LLM is now able to naturally search and play music without the automation. I am leaving this section in as an example, as I still believe automations can be a good way to solve some problems when there is not an easy way to give the LLM access to a certain feature.

It is certainly the most desirable outcome that every function would be executed perfectly by the LLM without intervention, but at least in my case with the model I am using that is not true. But there are cases where that really is not a bad thing.

In my case, music was one of this case. I believe this is an area that improvements are currently be made, but for me the automatic case was not working well. I started by getting music assistant setup. I found various LLM blueprints to create a script that allows the LLM to start playing music automatically, but it did not work well for me.

That is when I realized the power of the sentence automation trigger and the beauty of music assistant. I create an automation that triggers on Play {music}. The automation has a map of assist_satellite to media_player in the automation, so it will play music on the correct media player based on which satellite makes the request. Then it passes {music} (which can be a song, album, artist, whatever) to music assistant’s play service which performs the searching and starts playing.

Example Automation

alias: Music Shortcut
description: ""
triggers:
  - trigger: conversation
    command:
      - Play {music}
    id: play
  - trigger: conversation
    command: Stop playing
    id: stop
conditions: []
actions:
  - choose:
      - conditions:
          - condition: trigger
            id:
              - play
        sequence:
          - action: music_assistant.play_media
            metadata: {}
            data:
              media_id: "{{ trigger.slots.music }}"
            target:
              entity_id: "{{ target_player }}"
          - set_conversation_response: Playing {{ trigger.slots.music }}
      - conditions:
          - condition: trigger
            id:
              - stop
        sequence:
          - action: media_player.media_stop
            metadata: {}
            data: {}
            target:
              entity_id: "{{ target_player }}"
          - set_conversation_response: Stopped playing music.
variables:
  satellite_player_map: |
    {{
      {
        "assist_satellite.home_assistant_voice_xyz123": "media_player.my_desired_speaker",
      }
    }}
  target_player: |
    {{
      satellite_player_map.get(trigger.satellite_id, "media_player.default_speaker")
    }}
mode: single

Training a Custom Wakeword

The next problem to solve was the wakeword. For WAF the default included options weren’t going to work. After some back and forth we decided on Hey Robot. I use this repo to train a custom microwakeword which is usable on the VPE and Satellite1. This only took ~30 minutes to run on my GPU and the results have been quite good. There are some false positives, but overall the rate is similar to the Google Homes that have been replaced and with the ability to automate muting it is possible we can solve that problem with that until the training / options become better.

The End Result

I definitely would not recommend this for the average Home Assistant user, IMO a lot of patience and research is needed to understand particular problems and work towards a solution, and I imagine we will run into more problems as we continue to use these. I am certainly not done, but that is the beauty of this solution - most aspects of it can be tuned.

The goal has been met though, overall we have a more enjoyable voice assistant that runs locally without privacy concerns, and our core tasks are handled reliably.

Let me know what you think! I am happy to answer any questions.

我通往可靠且令人愉悦的本地托管语音助手之旅。 My Journey to a reliable and enjoyable locally hosted voice assistant

我通往可靠且令人愉悦的本地托管语音助手之旅。
My Journey to a reliable and enjoyable locally hosted voice assistant