TADA:通过文本-声学同步实现快速、可靠的语音生成
TADA: Fast, Reliable Speech Generation Through Text-Acoustic Synchronization

原始链接: https://www.hume.ai/blog/opensource-tada

## TADA:语音AI的突破 未来语音AI需要自然、快速和可靠的语音生成——当前系统面临的挑战在于语言模型处理文本和音频时的效率低下。Hume AI推出了TADA(文本-声学双重对齐),这是一种新的方法,通过一对一的标记化方案同步文本和语音,从而解决了这种核心不匹配问题。 TADA在不牺牲质量的情况下,实现了**比同类系统快5倍**的语音生成速度。重要的是,它**消除了内容幻觉**——跳词或捏造内容——通过强制执行文本和音频之间的严格对齐。其轻量级设计也使其能够**在设备上部署**,从而提高隐私并降低延迟。 Hume AI正在**开源TADA**(代码和预训练模型可在Hugging Face和GitHub上获得),以加速语音AI领域的创新。虽然在处理非常长的内容和多模态生成方面存在局限性,但TADA展示了在长篇叙述、对话式AI以及敏感行业中可靠语音接口等应用方面的巨大潜力。 这项突破有望为研究人员和开发人员提供更高效、更可靠、更易于访问的语音AI。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 TADA:通过文本-声学同步实现快速、可靠的语音生成 (hume.ai) 8 分,作者 smusamashah 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

The future of voice AI hinges on sounding natural, fast, expressive, and free of quirks like hallucinated words or skipped content. Today's LLM-based TTS systems are forced to choose between speed, quality, and reliability because of a fundamental mismatch between how text and audio are represented inside language models.

TADA (Text-Acoustic Dual Alignment) resolves that mismatch with a novel tokenization schema that synchronizes text and speech one-to-one. The result: the fastest LLM-based TTS system available, with competitive voice quality, virtually zero content hallucinations, and a footprint light enough for on-device deployment.

Hume AI is open-sourcing TADA to accelerate progress toward efficient, reliable voice generation. Code and pre-trained models are available now.

live demo

Approach

For every second of spoken audio, the acoustic signal carries far more information than the corresponding text. A second of audio might be 2–3 text tokens but 12.5–25 acoustic frames. This mismatch means LLM-based TTS systems must manage sequences where audio tokens vastly outnumber text tokens — leading to longer context windows, higher memory consumption, slower inference, and more opportunities for the model to lose track of what it's supposed to say.

Most existing systems address this by reducing audio frame rates or introducing intermediate "semantic" tokens between text and audio. Both approaches introduce their own tradeoffs: degraded expressiveness, added complexity, or both.

TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

For input audio, an encoder paired with an aligner extracts acoustic features from the audio segment corresponding to each text token. For output audio, the LLM's final hidden state serves as a conditioning vector for a flow-matching head, which generates acoustic features that are then decoded into audio and fed back into the model.

Since each LLM step corresponds to exactly one text token and one audio frame, TADA generates speech faster and with less computational effort. And because the architecture enforces a strict one-to-one mapping between text and audio, the model cannot skip or hallucinate content by construction.

Evaluation

Hallucination rate
Naturalness
Real-time factor
Speaker similarity

Speed

TADA generates speech at a real-time factor (RTF) of 0.09 — more than 5x faster than similar grade LLM-based TTS systems. This is possible because TADA operates at just 2–3 frames (tokens) per second of audio, compared to 12.5–75 tokens per second in other approaches.

Hallucination

Our model was trained on large scale, in-the-wild data, without post-training, and achieves the same reliability as models trained on smaller curated datasets. We measured hallucination rate by flagging any sample with a character error rate (CER) above 0.15 — a threshold that captures unintelligible speech, skipped text, and inserted content. In the 1000+ test samples from LibriTTSR, TADA produced zero hallucinations.

Voice Quality

In human evaluation on expressive, long-form speech (EARS dataset), TADA scored 4.18/5.0 on speaker similarity and 3.78/5.0 on naturalness, placing second overall — ahead of several systems trained on significantly more data.

Potential Applications

On-device deployment: TADA is lightweight enough to run on mobile phones and edge devices without requiring cloud inference. For device manufacturers and app developers building voice interfaces, this means lower latency, better privacy, and no API dependency.

Long-form and conversational speech: TADA's synchronous tokenization is dramatically more context-efficient than existing approaches. Where a conventional system exhausts a 2048-token context window in about 70 seconds of audio, TADA can accommodate roughly 700 seconds in the same budget. This opens the door to long-form narration, extended dialogue, and multi-turn voice interactions.

Production reliability: Zero hallucinations in our tests suggests fewer edge cases to catch, fewer customer complaints, and less post-processing overhead in the product. This makes TADA well-suited for deploying voice in regulated or sensitive environments like healthcare, finance, and education.

Limitations and Future Work

Long-form degradation: While the model supports more than 10 minutes of context, we noticed occasional cases of speaker drift during long generations. Our online rejection sampling strategy reduces this significantly, but it's not fully resolved. We suggest resetting the context as an intermediate workaround.

The modality gap: When the model generates text alongside speech, language quality drops relative to text-only mode. We introduce Speech Free Guidance (SFG), a technique that blends logits from text-only and text-speech inference modes to help close this gap, but more work is required.

Use-cases: The model is only pre-trained on speech continuation; further fine-tuning is required for assistant scenarios. Get in touch to inquire about Hume's extensive library of fine-tuning data.

Scale: The current release covers English and seven additional languages, so there's clear room to expand. We're training larger models with broader language coverage with Hume AI data.

We're releasing TADA because we believe this architecture opens a productive direction for the field, and we want to accelerate progress. We invite researchers and developers to build on this work — whether that means extending the tokenizer to new modalities, solving the long-context problem, or adapting the framework for new applications.

Get Started

TADA is available now under an open-source license. We're releasing 1B and 3B parameter Llama-based models and the full audio tokenizer and decoder.

1B (English): huggingface.co/HumeAI/tada-1b

3B (multilingual): huggingface.co/HumeAI/tada-3b-ml

Demo: huggingface.co/spaces/HumeAI/tada

GitHub: github.com/HumeAI/tada

arXiv: https://arxiv.org/abs/2602.23068

Hume builds voice AI research infrastructure for frontier labs and AI-first enterprises. If you're working on voice models and need high-quality training data, evaluation systems, or reinforcement learning infrastructure, get in touch at [email protected].

联系我们 contact @ memedata.com