文本转语音仍然很糟糕。

文本转语音仍然很糟糕。
TTS Still Sucks

原始链接: https://duarteocarmo.com/blog/tts-still-sucks

作者运营一个播客，该播客通过使用开源文本转语音（TTS）模型自动从博客文章生成——这是一个作者自我设定的挑战。在使用F5-TTS后，他们寻求升级，并参考了人工智能分析TTS排行榜。虽然Kokoro排名第一，但它缺乏语音克隆功能。测试Fish Audio的S1-mini令人失望，开源版本功能有限，这反映了一种常见的策略，即吸引用户使用付费的、更高级的模型。Chatterbox最终成为最佳可行选择，但与其他开源TTS一样，存在局限性：字符限制较短（约1000-2000个），导致处理较长文本时出现幻觉或速度问题。整个流程包括使用LLM生成稿件和摘要、分块、通过Modal容器使用Chatterbox进行并行处理，以及拼接音频。改进包括Spotify可用性和可点击的节目说明。尽管取得了进展，作者认为开源TTS在可靠性和控制方面仍然落后于专有系统，需要像逐句输入这样的解决方法。整个流程是开源的，并且可在GitHub上获取。

## 文本转语音技术仍有提升空间一篇最近的文章强调了文本转语音（TTS）技术的发展现状，引发了黑客新闻的讨论，揭示了虽然正在取得进展，但真正令人信服的TTS仍然难以实现。目前的开放权重模型被认为是“还可以”，但落后于Suno AI和Sora等封闭的专有模型，后者正接近人类水平的质量——尽管即使是这些模型在对话流畅性方面也存在困难。一些评论员指出，谷歌在2017年已经实现了高质量的TTS，但由于潜在的滥用（诈骗、虚假信息）而故意限制了其发布。这凸显了一个关键问题：滥用风险正在阻碍真正先进、通用型TTS的发展。虽然“超人类”TTS在技术上是可行的，但它更有可能来自小型独立项目，或者在法律框架解决滥用责任问题之后出现。用户在使用Microsoft VibeVoice等本地TTS解决方案时，发现质量不稳定且资源密集。最终，共识是TTS尚未完全解决，目前付费服务能提供最佳效果。

原文

or at least the open versions of it. I have this very stupid rule. A couple of years ago I decided to turn this blog into a podcast. At the time, I decided to make up a stupid rule: whatever model I use to clone my voice and generate article transcripts needs to be an open model.

Why? Because - as you might have figured by now - I like to make my life hard. The last version of the podcast generation engine was running on F5-TTS. It was fine. I still got some funny messages from people showing me the model completely hallucinating or squeaking here and there. But a year later - I was pretty sure there would be something incredibly better out there.

Now I’m not so sure.

The first step was to look for the best TTS models out there. Thankfully, Artificial Analysis now publishes a leaderboard with the “best” text-to-speech models. After filtering by my stupid rule of open models, we get the below ranking.

tts rankings

At the top of the leaderboard is Kokoro. Kokoro is an amazing model! Especially for a modest 82 Million (!) parameters and a mere 360 MB (!). However, like many models in this leaderboard - I can’t use it - since it doesn’t support voice cloning.

I started by looking at some of the stuff from Fish Audio. Their codebase seems to now support their new S1-mini model. When testing it, most of the emotion markers did not work - or were only available in their closed version. The breaks and long pauses either. Also, the chunking parameter is completely unused throughout the codebase - so not sure why it’s there. It’s a common business model nowadays: announce a state of the art open model just to attract attention to your real, and the incredible powerful gated model you have to pay for.

My second-best option on the list was Chatterbox. This wave of TTS models comes with major limitations. They're all restricted to short character counts - around 1,000–2,000 characters, sometimes even less. Ask them to generate anything longer, and the voice starts hallucinating or speeds up uncontrollably.

XTTS-v2

F5-TTS

Chatterbox (latest version)

The transcript generation process is straightforward. First, text gets extracted from my RSS feed and pre-processed by an LLM to make it more "readable". The LLM generates a transcript, a short summary, and a list of links for the show notes. We then chunk the transcript and fire that off to a bunch of parallel Modal containers where we run the Chatterbox TTS model. Once we get everything back, we stitch the wav files together, and voilà! The episode is ready. The hosting is an S3 bucket. Really, that’s what you are paying your podcast host for - it’s a lucrative business!

I also made some improvements to the podcast generation side of things. First of all, the podcast is now also available on Spotify. Additionally, I fixed the show notes to now include nice clickable links in almost every podcast player. Looking at you Apple and your CDATA requirements!

Some thoughts on the Chatterbox model. It’s definitely better than F5-TTS. But there are however, some common annoyances with almost every open-source voice cloning model. The first is the limited duration of the generated speech. Anything over 1000 characters starts hallucinating. The second is lack of control. Some models have emotion tags, others have <pause> indicators. But almost every single one of these has been massively unreliable. To the point where I am splitting my text in a sentence per line and shipping that off to the TTS model to make things a bit more reliable.

So yes, from one side, TTS has come a long way. But when compared to other proprietary systems, TTS still sucks.

Note: The rss to podcast pipeline is open source and available if you want to re-use it in this GitHub repo.

文本转语音仍然很糟糕。 TTS Still Sucks

文本转语音仍然很糟糕。
TTS Still Sucks