(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43754124

Dia,一个新的开放权重、拥有16亿参数的对话生成模型,由开发者Toby和Jay在Hacker News上发布。与传统的逐句拼接的TTS模型不同,Dia能够一次性生成完整的对话。这种方法旨在生成更快速、更自然流畅的对话,并支持音频提示以保持声音和情感风格的一致性。 开发者受到NotebookLM播客功能的启发,从零开始构建了该模型,并大量借鉴了SoundStorm和Parakeet。他们计划发布一份技术报告,分享他们的经验并鼓励进一步的研究。 该项目获得了积极的反馈,用户们询问了模型的稳定性,特别是关于口音一致性和处理专业术语(例如医学术语)的能力。一位用户指出了与现有的开源图表创建应用程序GNOME Dia的命名冲突,开发者对此表示认可,并表示会澄清两者之间的区别。开发者鼓励开源贡献。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Dia, an open-weights TTS model for generating realistic dialogue (github.com/nari-labs)
21 points by toebee 23 minutes ago | hide | past | favorite | 5 comments










Why does it say "join waitlist" if it's already available?

Also, you don't need to explicitly create and activate a venv if you're using uv - it deals with that nonsense itself. Just `uv sync`.



Impressive project! We'd love to use something like this over at Delfa (https://delfa.ai). How does this hold up from the perspective of stability? I've spoken to various folks working on voice models, and one thing that has consistently held Eleven Labs ahead of the pack from my experience is that their models seem to mostly avoid (while albeit not being immune to) accent shifts and distortions when confronted with unfamiliar medical terminology.

A high quality, affordable TTS model that can consistently nail medical terminology while maintaining an American accent has been frustratingly elusive.



Hey HN! We’re Toby and Jay, creators of Dia. Dia is 1.6B parameter open-weights model that generates dialogue directly from a transcript.

Unlike TTS models that generate each speaker turn and stitch them together, Dia generates the entire conversation in a single pass. This makes it faster, more natural, and easier to use for dialogue generation.

It also supports audio prompts — you can condition the output on a specific voice/emotion and it will continue in that style.

Demo page comparing it to ElevenLabs and Sesame-1B https://yummy-fir-7a4.notion.site/dia

We started this project after falling in love with NotebookLM’s podcast feature. But over time, the voices and content started to feel repetitive. We tried to replicate the podcast-feel with APIs but it did not sound like human conversations.

So we decided to train a model ourselves. We had no prior experience with speech models and had to learn everything from scratch — from large-scale training, to audio tokenization. It took us a bit over 3 months.

Our work is heavily inspired by SoundStorm and Parakeet. We plan to release a lightweight technical report to share what we learned and accelerate research.

We’d love to hear what you think! We are a tiny team, so open source contributions are extra-welcomed. Please feel free to check out the code, and share any thoughts or suggestions with us.



just in case, another opensource project using same name https://wiki.gnome.org/Apps/Dia/

https://gitlab.gnome.org/GNOME/dia



Thanks for the heads-up! We weren’t aware of the GNOME Dia project. Since we focus on speech AI, we’ll make sure to clarify that distinction.






Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com