Sopro TTS:一个169M模型,具有零样本语音克隆功能,可在CPU上运行。
Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

原始链接: https://github.com/samuel-vitorino/sopro

## Sopro:一个轻量级的文本转语音模型 Sopro,在葡萄牙语中意为“呼吸”,是一个拥有1.69亿参数的文本转语音(TTS)模型,作为副项目创建并在单个L40S GPU上训练。与许多TTS模型不同,Sopro利用扩张卷积和交叉注意力,而非Transformer架构。 主要特性包括流式能力、零样本语音克隆(需要3-12秒的参考音频)以及CPU上的实时系数为0.25(生成30秒音频需要7.5秒)。安装简单直接,但性能会因Torch版本而异(推荐2.6.0)。 虽然不是最先进的,但Sopro提供了一个功能性的TTS解决方案,具有可调节的风格强度和停止行为参数。开发者指出,语音克隆质量对音频输入敏感,建议使用完整的单词而非缩写。可以通过网页界面和Docker访问演示。未来的改进可能包括训练代码发布和扩展语言支持。

## Sopro TTS:CPU上的零样本语音克隆 Sopro TTS 是一种新的1.69亿参数的文本转语音模型,能够进行零样本语音克隆,意味着它可以在未专门训练的情况下模仿声音。该模型由sammyyyyyyy开发,并在GitHub上提供,可在CPU上运行,无需强大的GPU即可使用。 初步反应不一。虽然有些人认为这项技术令人印象深刻,但许多用户报告音频质量差,描述为明显低于旧的计算机声音,甚至“损坏”。开发者承认质量取决于参考音频,并鼓励尝试参数。 讨论围绕潜在用途展开——从恢复失去的声音等辅助功能到自动化有声读物等任务——以及对恶意应用(如诈骗和冒充)的担忧。争论的中心在于收益是否大于风险,以及鉴于需要语音样本,是否准确地使用了“零样本”一词。 许多用户还提到了 Chatterbox-TTS-Server 和 ElevenLabs 等替代方案,以及 RVC 等开源变声器。开发者表示,如果社区兴趣浓厚,愿意改进该模型。
相关文章

原文
sopro_readme.mp4

Alt Text

Sopro (from the Portuguese word for “breath/blow”) is a lightweight English text-to-speech model I trained as a side project. Sopro is composed of dilated convs (à la WaveNet) and lightweight cross-attention layers, instead of the common Transformer architecture. Even though Sopro is not SOTA across most voices and situations, I still think it’s a cool project made with a very low budget (trained on a single L40S GPU), and it can be improved with better data.

Some of the main features are:

  • 169M parameters
  • Streaming
  • Zero-shot voice cloning
  • 0.25 RTF on CPU (measured on an M3 base model), meaning it generates 30 seconds of audio in 7.5 seconds
  • 3-12 seconds of reference audio for voice cloning

I only pinned the minimum dependency versions so you can install the package without having to create a separate env. However, some versions of Torch work best. For example, on my M3 CPU, torch==2.6.0 (without torchvision) achieves ~3× more performance.

(Optional)

conda create -n soprotts python=3.10
conda activate soprotts
git clone https://github.com/samuel-vitorino/sopro
cd sopro
pip install -e .

soprotts \
  --text "Sopro is a lightweight 169 million parameter text-to-speech model. Some of the main features are streaming, zero-shot voice cloning, and 0.25 real-time factor on the CPU." \
  --ref_audio ref.wav \
  --out out.wav

You have the expected temperature and top_p parameters, alongside:

  • --style_strength (controls the FiLM strength; increasing it can improve or reduce voice similarity; default 1.0)
  • --no_stop_head to disable early stopping
  • --stop_threshold and --stop_patience (number of consecutive frames that must be classified as final before stopping). For short sentences, the stop head may fail to trigger, in which case you can lower these values. Likewise, if the model stops before producing the full text, adjusting these parameters up can help.
from sopro import SoproTTS

tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")

wav = tts.synthesize(
    "Hello! This is a non-streaming Sopro TTS example.",
    ref_audio_path="ref.wav",
)

tts.save_wav("out.wav", wav)
import torch
from sopro import SoproTTS

tts = SoproTTS.from_pretrained("samuel-vitorino/sopro", device="cpu")

chunks = []
for chunk in tts.stream(
    "Hello! This is a streaming Sopro TTS example.",
    ref_audio_path="ref.mp3",
):
    chunks.append(chunk.cpu())

wav = torch.cat(chunks, dim=-1)
tts.save_wav("out_stream.wav", wav)

Interactive streaming demo

Screenshot

After you install the sopro package:

pip install -r demo/requirements.txt
uvicorn demo.server:app --host 0.0.0.0 --port 8000

Or with docker:

docker build -t sopro-demo .
docker run --rm -p 8000:8000 sopro-demo

Navigate to http://localhost:8000 on your browser.


  • Sopro can be inconsistent, so mess around with the parameters until you get a decent sample.
  • Voice cloning is highly dependent on mic quality, ambient noise, etc. On more OOD voices it might fail to match the voice well.
  • Prefer phonemes instead of abbreviations and symbols. For example, “1 + 2”“1 plus 2”. That said, Sopro can generally read abbreviations like “CPU”, “TTS”, etc.
  • The streaming version is not bit-exact compared to the non-streaming version. For best quality, prioritize the non-streaming version.
  • If you use torchaudio to read or write audio, ffmpeg may be required. I recommend just using soundfile.
  • I will publish the training code once I have time to organize it.

Due to budget constraints, the dataset used for training was pre-tokenized and the raw audio was discarded (it took up a lot of space). Later in training, I could have used the raw audio to improve the speaker embedding / voice similarity, because some nuances of voice are lost when you compress it with a neural codec into a discrete space.

I didn't lose much time trying to optimize further, but there is still some room for improvement. For example, caching conv states.

Currently, generation is limited to ~32 seconds (400 frames). You can increase it, but the model generally hallucinates beyond that.

AI was used mainly for creating the web demo, organizing my messy code into this repo, ablations and brainstorming.

I would love to support more languages and continue improving the model. If you like this project, consider buying me a coffee so I can buy more compute: https://buymeacoffee.com/samuelvitorino



联系我们 contact @ memedata.com