Neutts-air – 开源、设备端TTS

Neutts-air – 开源、设备端TTS
Neutts-air – Open-source, on device TTS

原始链接: https://github.com/neuphonic/neutts-air

## NeuTTS Air：设备端、逼真语音AI Neuphonic的NeuTTS Air是一款突破性的文本转语音（TTS）模型，将最先进的语音AI直接带到您的设备上——手机、笔记本电脑，甚至树莓派——无需依赖网络API。它基于0.5B LLM构建，提供令人惊叹的自然语音、实时性能以及内置安全功能，如水印。主要特性包括：在同等尺寸下最佳的逼真度，仅需3秒的音频即可实现即时语音克隆，并通过GGML格式针对设备端使用进行优化。它支持英语，利用神经音频编解码器（NeuCodec）实现低比特率下的高质量音频，并处理约30秒的音频，具有2048 token的上下文窗口。 NeuTTS Air可通过HuggingFace获得，需要安装依赖项，如`espeak`，以及可选的`llama-cpp-python`或`onnxruntime`以获得最佳性能。该项目强调负责任的使用，并提供清晰的指南，用于准备用于克隆的参考音频和最大限度地减少延迟。

## NeuTTS Air：开源TTS，潜力与局限 NeuTTS Air是一个新的开源文本转语音（TTS）项目，因其能够在设备上运行而备受关注，即使在硬件受限的设备上，如树莓派和手机，这得益于其基于小型LLM（Qwen）。用户报告成功进行语音克隆，但遇到长文本生成限制——片段在远低于广告宣传的30秒上下文窗口之前就截断了。性能也较慢，在强大的硬件上生成4秒的片段需要超过16秒。讨论强调了比较快速涌现的TTS模型的难度，并担心有些模型只是现有技术的重新包装。虽然演示令人印象深刻，但用户强调需要独立评估输出质量。一个关键依赖项 *espeak*，引发了商业使用的许可问题（GPL3）。尽管存在这些问题，但人们对无需API成本的逼真TTS的潜力感到兴奋，并希望有一个可行的开源TTS应用程序在Android上使用，SherpaTTS被提及为当前的一个选择。

原文

HuggingFace 🤗: Model, Q8 GGUF, Q4 GGUF Spaces

neutts-demo.mp4

Created by Neuphonic - building faster, smaller, on-device voice AI

State-of-the-art Voice AI has been locked behind web APIs for too long. NeuTTS Air is the world’s first super-realistic, on-device, TTS speech language model with instant voice cloning. Built off a 0.5B LLM backbone, NeuTTS Air brings natural-sounding speech, real-time performance, built-in security and speaker cloning to your local device - unlocking a new category of embedded voice agents, assistants, toys, and compliance-safe apps.

🗣Best-in-class realism for its size - produces natural, ultra-realistic voices that sound human
📱Optimised for on-device deployment - provided in GGML format, ready to run on phones, laptops, or even Raspberry Pis
👫Instant voice cloning - create your own speaker with as little as 3 seconds of audio
🚄Simple LM + codec architecture built off a 0.5B backbone - the sweet spot between speed, size, and quality for real-world applications

NeuTTS Air is built off Qwen 0.5B - a lightweight yet capable language model optimised for text understanding and generation - as well as a powerful combination of technologies designed for efficiency and quality:

Supported Languages: English
Audio Codec: NeuCodec - our 50hz neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook
Context Window: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)
Format: Available in GGML format for efficient on-device inference
Responsibility: Watermarked outputs
Inference Speed: Real-time generation on mid-range devices
Power Consumption: Optimised for mobile and embedded devices

Clone Git Repo

git clone https://github.com/neuphonic/neutts-air.git

Install espeak (required dependency)

Please refer to the following link for instructions on how to install espeak:

https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md

# Mac OS
brew install espeak

# Ubuntu/Debian
sudo apt install espeak

Mac users may need to put the following lines at the top of the neutts.py file.

from phonemizer.backend.espeak.wrapper import EspeakWrapper
_ESPEAK_LIBRARY = '/opt/homebrew/Cellar/espeak/1.48.04_1/lib/libespeak.1.1.48.dylib'  #use the Path to the library.
EspeakWrapper.set_library(_ESPEAK_LIBRARY)

Windows users may need to run (see bootphon/phonemizer#163)

$env:PHONEMIZER_ESPEAK_LIBRARY = "c:\Program Files\eSpeak NG\libespeak-ng.dll"
$env:PHONEMIZER_ESPEAK_PATH = "c:\Program Files\eSpeak NG"
setx PHONEMIZER_ESPEAK_LIBRARY "c:\Program Files\eSpeak NG\libespeak-ng.dll"
setx PHONEMIZER_ESPEAK_PATH "c:\Program Files\eSpeak NG"

Install Python dependencies

The requirements file includes the dependencies needed to run the model with PyTorch. When using an ONNX decoder or a GGML model, some dependencies (such as PyTorch) are no longer required.

The inference is compatible and tested on python>=3.11.
```
pip install -r requirements.txt
```
(Optional) Install Llama-cpp-python to use the GGUF models.
```
pip install llama-cpp-python
```
To run llama-cpp with GPU suport (CUDA, MPS) support please refer to: https://pypi.org/project/llama-cpp-python/
(Optional) Install onnxruntime to use the .onnx decoder. If you want to run the onnxdecoder

Run the basic example script to synthesize speech:

python -m examples.basic_example \
  --input_text "My name is Dave, and um, I'm from London" \
  --ref_audio samples/dave.wav \
  --ref_text samples/dave.txt

To specify a particular model repo for the backbone or codec, add the --backbone argument. Available backbones are listed in NeuTTS-Air huggingface collection.

Several examples are available, including a Jupyter notebook in the examples folder.

from neuttsair.neutts import NeuTTSAir
import soundfile as sf

tts = NeuTTSAir(
   backbone_repo="neuphonic/neutts-air", # or 'neutts-air-q4-gguf' with llama-cpp-python installed
   backbone_device="cpu",
   codec_repo="neuphonic/neucodec",
   codec_device="cpu"
)
input_text = "My name is Dave, and um, I'm from London."

ref_text = "samples/dave.txt"
ref_audio_path = "samples/dave.wav"

ref_text = open(ref_text, "r").read().strip()
ref_codes = tts.encode_reference(ref_audio_path)

wav = tts.infer(input_text, ref_codes, ref_text)
sf.write("test.wav", wav, 24000)

Preparing References for Cloning

NeuTTS Air requires two inputs:

A reference audio sample (.wav file)
A text string

The model then synthesises the text as speech in the style of the reference audio. This is what enables NeuTTS Air’s instant voice cloning capability.

You can find some ready-to-use samples in the examples folder:

samples/dave.wav
samples/jo.wav

Guidelines for Best Results

For optimal performance, reference audio samples should be:

Mono channel
16-44 kHz sample rate
3–15 seconds in length
Saved as a .wav file
Clean — minimal to no background noise
Natural, continuous speech — like a monologue or conversation, with few pauses, so the model can capture tone effectively

Guidelines for minimizing Latency

For optimal performance on-device:

Use the GGUF model backbones
Pre-encode references
Use the onnx codec decoder

Take a look at this example examples README to get started.

Every audio file generated by NeuTTS Air includes Perth (Perceptual Threshold) Watermarker.

Don't use this model to do bad things… please.

To run the pre commit hooks to contribute to this project run:

Then: