使用英伟达开放模型构建语音代理

使用英伟达开放模型构建语音代理
Building voice agents with Nvidia open models

原始链接: https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/

## 使用 NVIDIA 开源模型构建超低延迟语音代理本文详细介绍了如何使用 NVIDIA 的开源模型构建快速语音代理：Nemotron Speech ASR 用于语音转文本，Nemotron 3 Nano 作为 LLM，以及 Magpie 用于文本转语音。目标是实现极低的延迟，以实现响应迅速的语音交互。该代理利用管道架构——目前处理复杂、企业级用例的最佳方法——并通过流式转录和交错推理等技术进行速度优化。Nemotron Speech ASR 在 24 毫秒内即可获得最终转录结果，媲美商业模型。Nemotron 3 Nano 在多轮对话中表现出色，并且定制的 Magpie 流式服务器进一步降低了延迟。代码可在 GitHub 上获取，并可在 Modal 上运行以进行可扩展部署，或在 NVIDIA DGX Spark/RTX 5090 上本地运行以进行开发。关键优化包括并行回合检测以及在单个 GPU 上仔细调度 LLM 和 TTS 推理，以用于本地设置。这项工作展示了开放模型在语音人工智能领域日益增长的潜力，提供定制化、控制能力以及针对特定需求进行优化的能力。NVIDIA 宽松的许可协议鼓励商业用途和该快速发展领域的进一步创新。

一个黑客新闻的讨论围绕着使用英伟达的开源模型构建语音代理，起因是daily.co的一个链接。用户们正在寻找现代的、易于安装（通过`apt`）的替代品，以取代较旧的Festival语音合成系统。有几个选项被提出：**Piper**，许多人认为它比Festival好得多，*可以*通过`apt`安装，尽管最初与同名的不同应用程序混淆。**Unmute.sh**也被强调为一个优秀的、但目前仅适用于英伟达的开源替代方案。对话显示出对现代语音合成质量日益增长的满意度，一位用户很高兴将其集成到他们的代理框架中，另一位用户希望将其与Cursor编辑器一起使用，以实现语音控制交互。总的来说，该帖子展示了对易于访问的、开源工具用于创建语音驱动应用程序的兴趣。

How to Build Ultra-low-latency Voice Agents With NVIDIA Cache-aware Streaming ASR

This post accompanies the launch of NVIDIA Nemotron Speech ASR on Hugging Face. Read the full model announcement here.

In this post, we’ll build a voice agent using three NVIDIA open models:

This voice agent leverages the new streaming ASR model, Pipecat’s low-latency voice agent building blocks, and some fun code experiments to optimize all three models for very fast response times.

Model	Openness	Deployment
Parakeet	open weights, open training data, open source inference	local in-cluster
Widely used commercial ASR	proprietary	cloud
Whisper Large V3	open weights, open source inference	local in-cluster

Model	Tool Use	Instruction	KB Ground	Pass Rate	Median Rate	TTFB Med	TTFB P95	TTFB Max
gpt-5.1	300/300	300/300	300/300	100.0%	100.0%	916ms	2011ms	5216ms
gemini-3-flash-preview	300/300	300/300	300/300	100.0%	100.0%	1193ms	1635ms	6653ms
claude-sonnet-4-5	300/300	300/300	300/300	100.0%	100.0%	2234ms	3062ms	5438ms
gpt-4.1	283/300	273/300	298/300	94.9%	97.8%	683ms	1052ms	3860ms
gemini-2.5-flash	275/300	268/300	300/300	93.7%	94.4%	594ms	1349ms	2104ms
gpt-5-mini	271/300	272/300	289/300	92.4%	95.6%	6339ms	17845ms	27028ms
gpt-4o-mini	271/300	262/300	293/300	91.8%	92.2%	760ms	1322ms	3256ms
nemotron-3-nano-30b-a3b*	287/304	286/304	298/304	91.4%	93.3%	171ms	199ms	255ms
gpt-4o	278/300	249/300	294/300	91.2%	95.6%	625ms	1222ms	13378ms
gpt-oss-120b (groq)	272/300	270/300	298/300	89.3%	90.0%	98ms	226ms	2117ms
gpt-5.2	224/300	228/300	250/300	78.0%	92.2%	819ms	1483ms	1825ms
claude-haiku-4-5	221/300	172/300	299/300	76.9%	75.6%	732ms	1334ms	4654ms

Model variant	Deployment	Resident memory
Nemotron-3-Nano BF16	full weights, Modal Cloud or DGX Spark	72GB
Nemotron-3-Nano Q8	8-bit quantization, faster operation on DGX Spark	32GB
Nemotron-3-Nano Q4	4-bit quantization, RTX 5090	24GB

Hardware	P50 Improvement	Mean Improvement	P90 Improvement
RTX 5090	90 ms (1.9x)	204 ms (3.0x)	430 ms (5.2x)
DGX Spark	236 ms (2.3x)	415 ms (3.3x)	836 ms (4.6x)

Mode	Min	Max	P50	P90	Mean
Batch	106 ms	630 ms	191 ms	533 ms	305 ms
Pipeline	99 ms	103 ms	101 ms	103 ms	101 ms