SaySynth：发声机器简史

SaySynth：发声机器简史
SaySynth: A Brief History of Speaking Machines

原始链接: https://brian.abelson.live/log/2025/12/20/saysynth-composition-codes.html

“SaySynth” 是一个创意项目，旨在将 macOS 传统的 `say` 命令（一个文本转语音框架）改造为音乐合成器。通过访问一个隐藏的底层音素领域特定语言（DSL），该工具允许用户在颗粒度层面操控音高和时值，将语音视为原始音频素材来处理。该项目置于会说话机器的发展史背景下，作者将其归类为机械式、共振峰式、拼接式和生成式系统。在整个历史中，反复出现的主题包括：将歌唱作为衡量人性的基准，以及通过持续将合成语音女性化来掩盖其背后隐形的劳动。作者认为，现代人工智能优先追求完美的自然度，往往剥离了定义创造性表达的“怪异感”。当商业语音合成（TTS）努力实现对人性无缝、高效的模拟时，SaySynth 却拥抱了“失败的声音”。通过将工具推向其预期设计之外，作者强调了机器局限性的质感往往比现代算法抛光后的输出更具表现力。归根结底，SaySynth 是对人类声音标准化的一种抗议，在资本主义试图抹除的故障中发现了美。

Hacker News | 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交 | 登录 SaySynth：发声机器简史 (abelson.live) 3 分，evakhoury 1 小时前 | 隐藏 | 往期 | 收藏 | 讨论 | 帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

These are expanded notes from a talk I gave at composition.codes on December 21, 2025. Slides here. Video here.

SaySynth is a synthesizer I built on top of macOS’s text-to-speech framework — more popularly known as the say command. But to explain why I built it and why I think it matters, I want to take a detour through the history of speaking machines more broadly.

A Typology of Speaking Machines

There are roughly four kinds of speaking machines that have existed over time:

Mechanical — Literally physical: bellows forcing air through a reed, with different knobs, valves, and whistles shaping different formants and phonemes. The human operator is part of the instrument.

Formant/Rule-Based — More like a synthesizer: an oscillator and a comb filter simulating the resonant shape of the vocal tract. The system models the acoustics of speech without recording any actual speech.

Sample-Based (Concatenative) — From something as crude as a toy with a phonograph inside, all the way to sophisticated “diphone” synthesizers that splice together recordings of every possible phoneme transition. GPS voices and automated customer service phone lines of the ’90s and 2000s were built this way.

Generative (Neural/AI) — What most people think of today. These are basically sample-based systems taken to an extreme: instead of recordings of phoneme pairs, you’re dealing with individual digital samples predicted by a neural network, sample by sample.