现代人工智能文本转语音系统对屏幕阅读器用户的现状

现代人工智能文本转语音系统对屏幕阅读器用户的现状
The state of modern AI text to speech systems for screen reader users

原始链接: https://stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html

几十年来，为盲人设计的文本转语音（TTS）技术一直落后于视力正常人群所享受的进步。视力正常的用户更看重自然的声音，而盲人用户则需要速度、清晰度和可预测性——通常更喜欢更机械的声音，以便以每分钟800-900个单词的高效率阅读。占据主导地位的语音Eloquence自2003年以来就没有更新过，并且与现代系统存在兼容性问题，需要复杂的解决方法。像Espeak-ng这样的替代方案支持多种语言，但其架构过时且维护有限。最近基于人工智能的TTS系统（如Supertonic和Kitten TTS）显示出潜力，但仍有不足。它们引入了依赖膨胀，影响屏幕阅读器的性能和安全性。更重要的是，它们牺牲了准确性——跳过单词或错误发音数字——并且缺乏盲人用户所必需的对语音参数的精细控制。它们在速度方面也存在问题，需要完整的文本块才能开始说话，从而阻碍了盲人用户依赖的快速导航。核心问题是现代TTS研究与屏幕阅读器用户的特定需求之间存在脱节。可行的解决方案需要大量的投资和专业知识，可能需要重新实现Eloquence，或者设计一个以盲人可访问性为优先的新系统。目前，用户可能不得不满足于不太理想的选择。

## 针对屏幕阅读器用户的AI文本转语音现状最近的讨论凸显了现代AI文本转语音（TTS）技术进步与盲人屏幕阅读器用户需求之间的脱节。虽然AI TTS旨在实现自然流畅的语音，但它常常牺牲准确性，例如跳字、数字读错以及缺乏语调——这对于高效高速阅读（800-900字/分钟）至关重要。许多高级用户仍然依赖于较旧的系统，如Eloquence（最后编译于2003年）和eSpeak，尽管它们年代久远且维护困难。现代化这些首选系统需要大量的投资——可能需要数百万美元——以及专业的知识。令人担忧的是，用户可能被迫接受“足够好”的AI语音，从而失去当前解决方案的速度和效率。有人建议利用AI来*反编译* Eloquence，但这是一项复杂的任务。另一些人提出了对话式AI界面，但这些界面被认为不如键盘操作的屏幕阅读器高效，并且引发了隐私问题。一个关键问题还在于Linux桌面环境中的可访问性支持，Wayland合成器缺乏标准化的屏幕阅读器支持，可能迫使用户转向Windows。最终，这场讨论强调了TTS开发需要优先考虑屏幕阅读器用户的准确性和速度，而不仅仅是关注自然度。

原文

If you're not a screen reader user yourself, you might be surprised to learn that the text to speech technology used by most blind people hasn't changed in the last 30 years. While text to speech has taken the sighted world by storm, in everything from personal assistants to GPS to telephone systems, the voices used by blind folks have remained mostly static. This is largely intentional. The needs of a blind text to speech user are vastly different than those of a sighted user. While sighted users prefer voices that are natural, conversational, and as human-like as possible, blind users tend to prefer voices that are fast, clear, predictable, and efficient. This results in a preference among blind users for voices that sound somewhat robotic, but can be understood at high rates of speed, often upwards of 800 to 900 words per minute. The speaking rate of an average person hovers around 200 to 250 words per minute, for comparison.

Unfortunately, this difference in needs has resulted in blind people getting left out of the explosion of text to speech advancement, and has caused many problems. First, the voice that is preferred by the majority of western English blind users, called Eloquence, was last updated in 2003. While it is so overwhelmingly popular that even Apple was eventually pressured to add the voice to iPhone, mac, Apple TV, and Apple Watch, even they were forced to use an emulation layer. As Eloquence is a 32-bit voice last compiled in 2003, it cannot run in modern software without some sort of emulation or bridge. If the sourcecode to Eloquence still exists and can be compiled, even large companies like Apple haven't managed to find or compile it. As the NVDA screen reader moves from being a 32-bit application to a 64-bit one, keeping eloquence running with it has been a challenge that I and many other community members have spent a lot of time and effort solving. The eloquence libraries also have many known security issues, and anyone using the libraries today is forced to understand and program around them, as Eloquence itself can never be updated or fixed. These stopgap solutions are entirely untenable, and are likely to take us only so far. A better solution is urgently needed.

The second problem this has caused is for those who speak languages other than English. As most modern text to speech voices are created by and for sighted users, blind users begin to find that the voices available in less popular languages are inefficient, overly conversational, slow, and otherwise unsatisfactory. While espeak-ng is an open-source text to speech system that attempts to support hundreds of languages while meeting the needs of blind users, it brings a different set of problems to the table. First, many of the languages it supports were added based on pronunciation rules taken from Wikipedia articles, without involving speakers of the language. Second, Espeak-ng is based directly on Speak, a text to speech system written by Jonathan Duddington in 1995 for RISC OS on the BBC Micro, meaning that espeak users today continue to have to live with many of the design decisions made back in 1995 for an operating system that no longer exists. Third, looking at the Espeak-ng repository, it seems to only have one or two active maintainers. While this is obviously better than the zero active maintainers of Eloquence, it could still become a problem in the future.

These are the reasons that I'm always interested in advancements in text to speech, and am actively keeping my ears open for something that takes advantage of modern technology, while continuing to suit the needs of screen reader users like myself.

Over the holiday break, I decided to take a look at two modern AI-based text to speech systems, and see if they could be added to NVDA. I chose two models, because they advertised themselves as fast, able to run without a GPU, and responsive. The first was supertonic, and the second was Kitten TTS. As both models require 64-bit Python, I wrote the addons for the 64-bit alpha of NVDA. However, other than making development easier, this had little effect on the results.

Unfortunately, doing this work uncovered a number of issues that I believe are common to all of the modern AI-based text to speech systems, and make them unsuitable for use in screen readers. The first issue is dependency bloat. In order to bundle these systems as NVDA addons, developers are required to include a vast multitude of large and complex Python packages. In the case of Kitten TTS, the number is around 103, and just over 30 for supertonic. As the standard building and packaging methods for NVDA addons do not support specifying and building requirements, these dependencies need to be manually copied over, included in any github repositories, and cannot be automatically updated. Loading all of these dependencies directly into NVDA also causes the screen reader to load slower, use more system resources, and opens NVDA users up to any security issue in any of these libraries. As a screen reader needs access to the entire system, this is far from ideal.

The second issue is accuracy. These modern systems are developed to sound human, natural, and conversational. Unfortunately this seems to come at the expense of accuracy. In my testing, both models had a tendency to skip words, read numbers incorrectly, chop off short utterances, and ignore prosody hints from text punctuation. Kitten TTS is slightly better here, as it uses a deterministic phonemizer (the same one used by espeak, actually) to determine the correct way to pronounce words, leaving only the generation of the speech itself up to AI. But never the less, Kitten TTS is still far from perfectly accurate. When it comes to use in a screen reader, skipping words, or reading numbers incorrectly, is unacceptable.

The third issue is speed. Supertonic has the edge, here, but even it is far too slow. Unlike older text to speech systems, Supertonic and Kitten TTS cannot begin generating speech until they have an entire chunk of text. Supertonic is slightly faster, as it can stream result audio as it becomes available, whereas Kitten TTS cannot start speaking until all of the audio for the chunk is fully generated. But for use in a screen reader, a text to speech system needs to begin generating speech as quickly as possible, rather than waiting for an entire phrase or sentence. Users of screen readers quickly jump through text and frequently interrupt the screen reader, and thus require the text to speech system to be able to quickly discard and restart speech.

The fourth and final issue is control. Older text to speech systems make changing the pitch, speed, volume, breathiness, roughness, headsize, and other parameters of the voice easy. This allows screen reader users to customize the voice to our exact needs, as well as offering the ability to change the characteristics of the voice in real time based on the formatting or other attributes of the text. AI text to speech models, being trained on data from a particular set of speakers, cannot offer this customization. Instead, they inherit the speaking speed, pitch, volume, and other characteristics that were present in the training data. Kitten TTS and Supertonic both offer basic speed control, however it is highly variable from voice to voice and utterance to utterance. This leads to a loss of functionality that many blind users depend on.

If you'd like to experience these issues for yourself, feel free to follow the links above to my GitHub repositories. They offer ready to install addons that can be installed and used with the 64-bit NVDA alphas.

I'm picking on Kitten TTS and Supertonic not because they're particularly bad for the above problems, but because they're the models that are the state of the art in AI text to speech right now when it comes to speed and size. Other models, like Kokoro, exhibit all of the same issues, but more so.

So what's the way forward for blind screen reader users? Sadly, I don't know. Modern text to speech research has little to no overlap with our requirements. Using Eloquence, the system that many blind people find best, is becoming increasingly untenable. ESpeak uses an odd architecture originally designed for computers in 1995, and has few maintainers. Blastbay Studios has done some interesting work to create a text to speech voice using modern design and technology, that meets the requirements of blind users. But it's a closed-source product with a single maintainer, that also suffers from a lack of pronunciation accuracy. In an ideal world, someone would re-implement Eloquence as a set of open source libraries. However, doing so would require expertise in linguistics, digital signal processing, and audiology, as well as excellent programming abilities. My suspicion is that modernizing the text to speech stack that is preferred by blind power-users is an effort that would require several million dollars of funding at minimum. Instead, we'll probably wind up having to settle for text to speech voices that are "good enough", while being nowhere near as fast and efficient as what we have currently. Personally, I intend to keep Eloquence limping along for as long as I can, until the layers of required emulation and bridges make real time use impossible. Perhaps at that point AI will be good enough that it can be prompted to create a text to speech system that's up to our standards. Or, more hopefully, articles like this one may bring attention to the issues, and bring our community together to recognize the problems and find solutions.

现代人工智能文本转语音系统对屏幕阅读器用户的现状 The state of modern AI text to speech systems for screen reader users

现代人工智能文本转语音系统对屏幕阅读器用户的现状
The state of modern AI text to speech systems for screen reader users