芝麻CSM：一款对话式语音生成模型

芝麻CSM：一款对话式语音生成模型
Sesame CSM: A Conversational Speech Generation Model

原始链接: https://github.com/SesameAILabs/csm

2025年3月13日，芝麻AI实验室发布了会话语音模型CSM-1B，现已在Hugging Face平台上线。CSM利用Llama主干网络和Mimi音频解码器，根据文本和音频输入生成RVQ音频代码。Hugging Face空间上提供模型演示。使用CSM需要CUDA兼容的GPU，Python 3.10（或更高版本），以及可能需要ffmpeg。克隆代码库并安装依赖项（包括`triton`或`triton-windows`）后，您可以根据文本生成音频，也可以提供上下文片段以获得更真实的对话效果。该模型并非针对特定声音进行微调，但可以生成多种声音。 CSM是一个音频生成模型，并非通用的LLM；文本生成请使用独立的LLM。虽然具有一定的非英语语言能力，但性能有限。务必遵守负责任的使用原则。严禁模仿他人、传播虚假信息和从事任何非法活动。该模型仅供研究和教育用途。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录芝麻CSM：一款对话式语音生成模型 (github.com/sesameailabs) 8 分，由 tosh 发表，2 小时前 | 隐藏 | 过去 | 收藏 | 讨论加入我们，参加 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

会说话的骆驼 2023-11-03

（评论） 2024-07-17

（评论） 2025-03-03

米斯特拉尔尼莫 2024-07-19

原文

2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on Hugging Face.

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post.

A hosted Hugging Face space is also available for testing audio generation.

A CUDA-compatible GPU
The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
Similarly, Python 3.10 is recommended, but newer versions may be fine
For some audio operations, ffmpeg may be required
Access to the following Hugging Face models:

git clone [email protected]:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login

The triton package cannot be installed in Windows. Instead use pip install triton-windows.

Generate a sentence

from generator import load_csm_1b
import torchaudio
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

generator = load_csm_1b(device=device)

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

CSM sounds best when provided with context. You can prompt or provide context to the model using a Segment for each speaker's utterance.

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Does this model come with any voices?

The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.

Can I converse with the model?

CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

Does it support other languages?

The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.

This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

芝麻CSM：一款对话式语音生成模型 Sesame CSM: A Conversational Speech Generation Model

芝麻CSM：一款对话式语音生成模型
Sesame CSM: A Conversational Speech Generation Model