Show HN：近乎实时的AI语音聊天，延迟约500毫秒

Show HN：近乎实时的AI语音聊天，延迟约500毫秒
Show HN: Real-time AI Voice Chat at ~500ms Latency

原始链接: https://github.com/KoljaB/RealtimeVoiceChat

本项目使你可以通过语音与AI进行实时对话。它捕捉你的语音，将其传输到Python后端进行转录 (RealtimeSTT) 和大型语言模型处理 (Ollama 或 OpenAI)。AI的回复会合成语音 (RealtimeTTS，可选引擎包括Kokoro、Coqui或Orpheus) 并回传给你。它具有中断处理、低延迟、动态静音检测和简单的Web界面。安装方式包括Docker（推荐，尤其适用于Linux/GPU）或手动Python安装。Docker简化了依赖管理，并包含Ollama。手动安装需要管理你的Python环境并安装PyTorch，为了获得最佳性能，可能需要支持CUDA的NVIDIA GPU。TTS引擎、LLM后端和STT设置等关键配置选项可以在代码文件中自定义。

Koljab 推出了 RealtimeVoiceChat，一个开源系统，旨在实现与大型语言模型 (LLM) 进行实时、本地语音对话，目标是达到自然的对话速度并解决延迟带来的困扰。它通过 WebSockets 传输音频块，使用 RealtimeSTT（基于 Whisper）和 RealtimeTTS（支持 Coqui XTTSv2/Kokoro）实现了大约 500ms 的响应延迟。该系统设计用于本地 LLM，例如 Ollama，并包含一个 OpenAI 连接器。主要功能包括可中断对话、智能轮次检测以避免用户被打断，以及 Docker 化的设置。为了获得最佳性能，它需要一台性能不错的支持 CUDA 的 GPU。讨论重点关注自然对话流程的挑战，特别是关于停顿和中断的问题。一些用户建议 AI 需要分析输入上下文以确定停顿是否是有意的。其他人讨论了语音到文本、文本到语音的模型，例如 Sesame 和 Dia。还讨论了自定义轮次检测策略。

Show HN：轻如终端——GTK LLM 聊天前端 2025-04-21

StreamDiffusion：实时交互生成的管道级解决方案 2023-12-25

Show HN：Dia，一个用于生成逼真对话的开放权重 TTS 模型 2025-04-21

会说话的骆驼 2023-11-03

（评论） 2024-05-15

原文

Have a natural, spoken conversation with an AI!

This project lets you chat with a Large Language Model (LLM) using just your voice, receiving spoken responses in near real-time. Think of it as your own digital conversation partner.

FastVoiceTalk_compressed_step3_h264.mp4

(early preview - first reasonably stable version)

A sophisticated client-server system built for low-latency interaction:

🎙️ Capture: Your voice is captured by your browser.
➡️ Stream: Audio chunks are whisked away via WebSockets to a Python backend.
✍️ Transcribe: RealtimeSTT rapidly converts your speech to text.
🤔 Think: The text is sent to an LLM (like Ollama or OpenAI) for processing.
🗣️ Synthesize: The AI's text response is turned back into speech using RealtimeTTS.
⬅️ Return: The generated audio is streamed back to your browser for playback.
🔄 Interrupt: Jump in anytime! The system handles interruptions gracefully.

Fluid Conversation: Speak and listen, just like a real chat.
Real-Time Feedback: See partial transcriptions and AI responses as they happen.
Low Latency Focus: Optimized architecture using audio chunk streaming.
Smart Turn-Taking: Dynamic silence detection (turndetect.py) adapts to the conversation pace.
Flexible AI Brains: Pluggable LLM backends (Ollama default, OpenAI support via llm_module.py).
Customizable Voices: Choose from different Text-to-Speech engines (Kokoro, Coqui, Orpheus via audio_module.py).
Web Interface: Clean and simple UI using Vanilla JS and the Web Audio API.
Dockerized Deployment: Recommended setup using Docker Compose for easier dependency management.

Backend: Python 3.x, FastAPI
Frontend: HTML, CSS, JavaScript (Vanilla JS, Web Audio API, AudioWorklets)
Communication: WebSockets
Containerization: Docker, Docker Compose
Core AI/ML Libraries:
- RealtimeSTT (Speech-to-Text)
- RealtimeTTS (Text-to-Speech)
- transformers (Turn detection, Tokenization)
- torch / torchaudio (ML Framework)
- ollama / openai (LLM Clients)
Audio Processing: numpy, scipy

Before You Dive In: Prerequisites 🏊‍♀️

This project leverages powerful AI models, which have some requirements:

Operating System:
- Docker: Linux is recommended for the best GPU integration with Docker.
- Manual: The provided script (install.bat) is for Windows. Manual steps are possible on Linux/macOS but may require more troubleshooting (especially for DeepSpeed).
🐍 Python: 3.9 or higher (if setting up manually).
🚀 GPU: A powerful CUDA-enabled NVIDIA GPU is highly recommended, especially for faster STT (Whisper) and TTS (Coqui). Performance on CPU-only or weaker GPUs will be significantly slower.
- The setup assumes CUDA 12.1. Adjust PyTorch installation if you have a different CUDA version.
- Docker (Linux): Requires NVIDIA Container Toolkit.
🐳 Docker (Optional but Recommended): Docker Engine and Docker Compose v2+ for the containerized setup.
🧠 Ollama (Optional): If using the Ollama backend without Docker, install it separately and pull your desired models. The Docker setup includes an Ollama service.
🔑 OpenAI API Key (Optional): If using the OpenAI backend, set the OPENAI_API_KEY environment variable (e.g., in a .env file or passed to Docker).

Getting Started: Installation & Setup ⚙️

Clone the repository first:

git clone https://github.com/KoljaB/RealtimeVoiceChat.git
cd RealtimeVoiceChat

Now, choose your adventure:

🚀 Option A: Docker Installation (Recommended for Linux/GPU)

This is the most straightforward method, bundling the application, dependencies, and even Ollama into manageable containers.

Build the Docker images: (This takes time! It downloads base images, installs Python/ML dependencies, and pre-downloads the default STT model.)

(If you want to customize models/settings in code/*.py, do it before this step!)
Start the services (App & Ollama): (Runs containers in the background. GPU access is configured in docker-compose.yml.)

Give them a minute to initialize.

(Crucial!) Pull your desired Ollama Model: (This is done after startup to keep the main app image smaller and allow model changes without rebuilding. Execute this command to pull the default model into the running Ollama container.)

# Pull the default model (adjust if you configured a different one in server.py)
docker compose exec ollama ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M

# (Optional) Verify the model is available
docker compose exec ollama ollama list

Stopping the Services:
Restarting:
Viewing Logs / Debugging:
- Follow app logs: docker compose logs -f app
- Follow Ollama logs: docker compose logs -f ollama
- Save logs to file: docker compose logs app > app_logs.txt

🛠️ Option B: Manual Installation (Windows Script / venv)

This method requires managing the Python environment yourself. It offers more direct control but can be trickier, especially regarding ML dependencies.

B1) Using the Windows Install Script:

Ensure you meet the prerequisites (Python, potentially CUDA drivers).
Run the script. It attempts to create a venv, install PyTorch for CUDA 12.1, a compatible DeepSpeed wheel, and other requirements. (This opens a new command prompt within the activated virtual environment.) Proceed to the "Running the Application" section.

B2) Manual Steps (Linux/macOS/Windows):

Create & Activate Virtual Environment:

python -m venv venv
# Linux/macOS:
source venv/bin/activate
# Windows:
.\venv\Scripts\activate

Upgrade Pip:
```
python -m pip install --upgrade pip
```
Navigate to Code Directory:

Install PyTorch (Crucial Step - Match Your Hardware!):

With NVIDIA GPU (CUDA 12.1 Example):

# Verify your CUDA version! Adjust 'cu121' and the URL if needed.
pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121

CPU Only (Expect Slow Performance):

# pip install torch torchaudio torchvision

Find other PyTorch versions: https://pytorch.org/get-started/previous-versions/

Install Other Requirements:
```
pip install -r requirements.txt
```
- Note on DeepSpeed: The requirements.txt may include DeepSpeed. Installation can be complex, especially on Windows. The install.bat tries a precompiled wheel. If manual installation fails, you might need to build it from source or consult resources like deepspeedpatcher (use at your own risk). Coqui TTS performance benefits most from DeepSpeed.

Running the Application ▶️

If using Docker: Your application is already running via docker compose up -d! Check logs using docker compose logs -f app.

If using Manual/Script Installation:

Activate your virtual environment (if not already active):

# Linux/macOS: source ../venv/bin/activate
# Windows: ..\venv\Scripts\activate

Navigate to the code directory (if not already there):
Start the FastAPI server:

Accessing the Client (Both Methods):

Open your web browser to http://localhost:8000 (or your server's IP if running remotely/in Docker on another machine).
Grant microphone permissions when prompted.
Click "Start" to begin chatting! Use "Stop" to end and "Reset" to clear the conversation.

Configuration Deep Dive 🔧

Want to tweak the AI's voice, brain, or how it listens? Modify the Python files in the code/ directory.

⚠️ Important Docker Note: If using Docker, make any configuration changes before running docker compose build to ensure they are included in the image.

TTS Engine & Voice (server.py, audio_module.py):
- Change START_ENGINE in server.py to "coqui", "kokoro", or "orpheus".
- Adjust engine-specific settings (e.g., voice model path for Coqui, speaker ID for Orpheus, speed) within AudioProcessor.__init__ in audio_module.py.
LLM Backend & Model (server.py, llm_module.py):
- Set LLM_START_PROVIDER ("ollama" or "openai") and LLM_START_MODEL (e.g., "hf.co/..." for Ollama, model name for OpenAI) in server.py. Remember to pull the Ollama model if using Docker (see Installation Step A3).
- Customize the AI's personality by editing system_prompt.txt.
STT Settings (transcribe.py):
- Modify DEFAULT_RECORDER_CONFIG to change the Whisper model (model), language (language), silence thresholds (silence_limit_seconds), etc. The default base.en model is pre-downloaded during the Docker build.
Turn Detection Sensitivity (turndetect.py):
- Adjust pause duration constants within the TurnDetector.update_settings method.
SSL/HTTPS (server.py):
- Set USE_SSL = True and provide paths to your certificate (SSL_CERT_PATH) and key (SSL_KEY_PATH) files.
- Docker Users: You'll need to adjust docker-compose.yml to map the SSL port (e.g., 443) and potentially mount your certificate files as volumes.
Generating Local SSL Certificates (Windows Example w/ mkcert)
1. Install Chocolatey package manager if you haven't already.
2. Install mkcert: choco install mkcert
3. Run Command Prompt as Administrator.
4. Install a local Certificate Authority: mkcert -install
5. Generate certs (replace your.local.ip): mkcert localhost 127.0.0.1 ::1 your.local.ip
  - This creates .pem files (e.g., localhost+3.pem and localhost+3-key.pem) in the current directory. Update SSL_CERT_PATH and SSL_KEY_PATH in server.py accordingly. Remember to potentially mount these into your Docker container.

Got ideas or found a bug? Contributions are welcome! Feel free to open issues or submit pull requests.

The core codebase of this project is released under the MIT License (see the LICENSE file for details).

This project relies on external specific TTS engines (like Coqui XTTSv2) and LLM providers which have their own licensing terms. Please ensure you comply with the licenses of all components you use.