Show HN：Dia，一个用于生成逼真对话的开放权重 TTS 模型

Show HN：Dia，一个用于生成逼真对话的开放权重 TTS 模型
Show HN: Dia, an open-weights TTS model for generating realistic dialogue

Dia 是来自 Nari Labs 的一个拥有 16 亿参数的开放权重文本转对话模型，它能够生成逼真的语音，并可以通过音频调节来控制情感和语气。它可以直接从文本转录中生成包含非语言提示（如笑声和咳嗽）的对话。预训练检查点和推理代码可在 Hugging Face 上获取。一个演示将 Dia 与 ElevenLabs Studio 和 Sesame CSM-1B 进行了比较。要开始使用，请克隆 GitHub 仓库，设置虚拟环境，安装依赖项，然后运行 `app.py` 脚本以启动 Gradio UI。该模型需要 GPU（CUDA 12.6，PyTorch 2.0+）以及大约 10GB 的 VRAM。在 A4000 GPU 上，它的生成速度约为每秒 40 个 token。 Dia 仅供研究和教育用途。严禁滥用，包括模仿身份、制作欺骗性内容和从事非法活动。Nari Labs 正在努力提高推理速度，并添加 CPU 支持和量化功能。欢迎贡献；加入 Discord 服务器参与讨论。

Dia，一个新的开放权重、拥有16亿参数的对话生成模型，由开发者Toby和Jay在Hacker News上发布。与传统的逐句拼接的TTS模型不同，Dia能够一次性生成完整的对话。这种方法旨在生成更快速、更自然流畅的对话，并支持音频提示以保持声音和情感风格的一致性。开发者受到NotebookLM播客功能的启发，从零开始构建了该模型，并大量借鉴了SoundStorm和Parakeet。他们计划发布一份技术报告，分享他们的经验并鼓励进一步的研究。该项目获得了积极的反馈，用户们询问了模型的稳定性，特别是关于口音一致性和处理专业术语（例如医学术语）的能力。一位用户指出了与现有的开源图表创建应用程序GNOME Dia的命名冲突，开发者对此表示认可，并表示会澄清两者之间的区别。开发者鼓励开源贡献。

米斯特拉尔尼莫 2024-07-19

Jamba：基于 Mamba 的生产级 AI 模型 2024-03-30

（评论） 2024-07-19

（评论） 2023-10-27

原文

Dia is a 1.6B parameter text to speech model created by Nari Labs.

Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on Hugging Face.

We also provide a demo page comparing our model to ElevenLabs Studio and Sesame CSM-1B.

Join our discord server for community support and access to new features.
Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the waitlist for early access.

This will open a Gradio UI that you can work on.

git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install uv
uv run app.py

import soundfile as sf

from dia.model import Dia


model = Dia.from_pretrained("nari-labs/Dia-1.6B")

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."

output = model.generate(text)

sf.write("simple.mp3", output, 44100)

A pypi package and a working CLI tool will be available soon.

💻 Hardware and Inference Speed

Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon. The initial run will take longer as the Descript Audio Codec also needs to be downloaded.

On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower. For reference, on a A4000 GPU, Dia rougly generates 40 tokens/s (86 tokens equals 1 second of audio). torch.compile will increase speeds for supported GPUs.

The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.

If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist here.

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are strictly forbidden:

Identity Misuse: Do not produce audio resembling real individuals without permission.
Deceptive Content: Do not use this model to generate misleading content (e.g. fake news)
Illegal or Malicious Use: Do not use this model for activities that are illegal or intended to cause harm.

By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are not responsible for any misuse and firmly oppose any unethical usage of this technology.

Docker support.
Optimize inference speed.
Add quantization for memory efficiency.

We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions! Join our Discord Server for discussions.

Show HN：Dia，一个用于生成逼真对话的开放权重 TTS 模型 Show HN: Dia, an open-weights TTS model for generating realistic dialogue

💻 Hardware and Inference Speed

Show HN：Dia，一个用于生成逼真对话的开放权重 TTS 模型
Show HN: Dia, an open-weights TTS model for generating realistic dialogue