If you're not a screen reader user yourself, you might be surprised to learn that the text to speech technology used by most blind people hasn't changed in the last 30 years. While text to speech has taken the sighted world by storm, in everything from personal assistants to GPS to telephone systems, the voices used by blind folks have remained mostly static. This is largely intentional. The needs of a blind text to speech user are vastly different than those of a sighted user. While sighted users prefer voices that are natural, conversational, and as human-like as possible, blind users tend to prefer voices that are fast, clear, predictable, and efficient. This results in a preference among blind users for voices that sound somewhat robotic, but can be understood at high rates of speed, often upwards of 800 to 900 words per minute. The speaking rate of an average person hovers around 200 to 250 words per minute, for comparison.
Unfortunately, this difference in needs has resulted in blind people getting left out of the explosion of text to speech advancement, and has caused many problems. First, the voice that is preferred by the majority of western English blind users, called Eloquence, was last updated in 2003. While it is so overwhelmingly popular that even Apple was eventually pressured to add the voice to iPhone, mac, Apple TV, and Apple Watch, even they were forced to use an emulation layer. As Eloquence is a 32-bit voice last compiled in 2003, it cannot run in modern software without some sort of emulation or bridge. If the sourcecode to Eloquence still exists and can be compiled, even large companies like Apple haven't managed to find or compile it. As the NVDA screen reader moves from being a 32-bit application to a 64-bit one, keeping eloquence running with it has been a challenge that I and many other community members have spent a lot of time and effort solving. The eloquence libraries also have many known security issues, and anyone using the libraries today is forced to understand and program around them, as Eloquence itself can never be updated or fixed. These stopgap solutions are entirely untenable, and are likely to take us only so far. A better solution is urgently needed.
The second problem this has caused is for those who speak languages other than English. As most modern text to speech voices are created by and for sighted users, blind users begin to find that the voices available in less popular languages are inefficient, overly conversational, slow, and otherwise unsatisfactory. While espeak-ng is an open-source text to speech system that attempts to support hundreds of languages while meeting the needs of blind users, it brings a different set of problems to the table. First, many of the languages it supports were added based on pronunciation rules taken from Wikipedia articles, without involving speakers of the language. Second, Espeak-ng is based directly on Speak, a text to speech system written by Jonathan Duddington in 1995 for RISC OS on the BBC Micro, meaning that espeak users today continue to have to live with many of the design decisions made back in 1995 for an operating system that no longer exists. Third, looking at the Espeak-ng repository, it seems to only have one or two active maintainers. While this is obviously better than the zero active maintainers of Eloquence, it could still become a problem in the future.
These are the reasons that I'm always interested in advancements in text to speech, and am actively keeping my ears open for something that takes advantage of modern technology, while continuing to suit the needs of screen reader users like myself.
Over the holiday break, I decided to take a look at two modern AI-based text to speech systems, and see if they could be added to NVDA. I chose two models, because they advertised themselves as fast, able to run without a GPU, and responsive. The first was supertonic, and the second was Kitten TTS. As both models require 64-bit Python, I wrote the addons for the 64-bit alpha of NVDA. However, other than making development easier, this had little effect on the results.
Unfortunately, doing this work uncovered a number of issues that I believe are common to all of the modern AI-based text to speech systems, and make them unsuitable for use in screen readers. The first issue is dependency bloat. In order to bundle these systems as NVDA addons, developers are required to include a vast multitude of large and complex Python packages. In the case of Kitten TTS, the number is around 103, and just over 30 for supertonic. As the standard building and packaging methods for NVDA addons do not support specifying and building requirements, these dependencies need to be manually copied over, included in any github repositories, and cannot be automatically updated. Loading all of these dependencies directly into NVDA also causes the screen reader to load slower, use more system resources, and opens NVDA users up to any security issue in any of these libraries. As a screen reader needs access to the entire system, this is far from ideal.
The second issue is accuracy. These modern systems are developed to sound human, natural, and conversational. Unfortunately this seems to come at the expense of accuracy. In my testing, both models had a tendency to skip words, read numbers incorrectly, chop off short utterances, and ignore prosody hints from text punctuation. Kitten TTS is slightly better here, as it uses a deterministic phonemizer (the same one used by espeak, actually) to determine the correct way to pronounce words, leaving only the generation of the speech itself up to AI. But never the less, Kitten TTS is still far from perfectly accurate. When it comes to use in a screen reader, skipping words, or reading numbers incorrectly, is unacceptable.
The third issue is speed. Supertonic has the edge, here, but even it is far too slow. Unlike older text to speech systems, Supertonic and Kitten TTS cannot begin generating speech until they have an entire chunk of text. Supertonic is slightly faster, as it can stream result audio as it becomes available, whereas Kitten TTS cannot start speaking until all of the audio for the chunk is fully generated. But for use in a screen reader, a text to speech system needs to begin generating speech as quickly as possible, rather than waiting for an entire phrase or sentence. Users of screen readers quickly jump through text and frequently interrupt the screen reader, and thus require the text to speech system to be able to quickly discard and restart speech.
The fourth and final issue is control. Older text to speech systems make changing the pitch, speed, volume, breathiness, roughness, headsize, and other parameters of the voice easy. This allows screen reader users to customize the voice to our exact needs, as well as offering the ability to change the characteristics of the voice in real time based on the formatting or other attributes of the text. AI text to speech models, being trained on data from a particular set of speakers, cannot offer this customization. Instead, they inherit the speaking speed, pitch, volume, and other characteristics that were present in the training data. Kitten TTS and Supertonic both offer basic speed control, however it is highly variable from voice to voice and utterance to utterance. This leads to a loss of functionality that many blind users depend on.
If you'd like to experience these issues for yourself, feel free to follow the links above to my GitHub repositories. They offer ready to install addons that can be installed and used with the 64-bit NVDA alphas.
I'm picking on Kitten TTS and Supertonic not because they're particularly bad for the above problems, but because they're the models that are the state of the art in AI text to speech right now when it comes to speed and size. Other models, like Kokoro, exhibit all of the same issues, but more so.
So what's the way forward for blind screen reader users? Sadly, I don't know. Modern text to speech research has little to no overlap with our requirements. Using Eloquence, the system that many blind people find best, is becoming increasingly untenable. ESpeak uses an odd architecture originally designed for computers in 1995, and has few maintainers. Blastbay Studios has done some interesting work to create a text to speech voice using modern design and technology, that meets the requirements of blind users. But it's a closed-source product with a single maintainer, that also suffers from a lack of pronunciation accuracy. In an ideal world, someone would re-implement Eloquence as a set of open source libraries. However, doing so would require expertise in linguistics, digital signal processing, and audiology, as well as excellent programming abilities. My suspicion is that modernizing the text to speech stack that is preferred by blind power-users is an effort that would require several million dollars of funding at minimum. Instead, we'll probably wind up having to settle for text to speech voices that are "good enough", while being nowhere near as fast and efficient as what we have currently. Personally, I intend to keep Eloquence limping along for as long as I can, until the layers of required emulation and bridges make real time use impossible. Perhaps at that point AI will be good enough that it can be prompted to create a text to speech system that's up to our standards. Or, more hopefully, articles like this one may bring attention to the issues, and bring our community together to recognize the problems and find solutions.