为什么从录音中去除“嗯”声比听起来要难

为什么从录音中去除“嗯”声比听起来要难
Removing 'um' from a recording is harder than it sounds

原始链接: https://doug.sh/posts/erm-a-local-cli-that-strips-ums-uhs-and-erms-from-speech/

**erm** 是一款旨在自动从语音录音中去除“口语干扰词”（如 um、uh、er）的工具。虽然简单的处理方法——即通过 Whisper 转录并切除标记——往往会导致明显的杂音和不自然的剪辑效果，但 *erm* 通过一种复杂的多重处理流程解决了这些问题。为了确保高质量、无缝的音频效果，*erm* 采用了多种技术： * **高级检测：** 除了标准的转录外，它还会扫描音频中那些 Whisper 可能会忽略的缺失填充词、与单词粘连的填充词以及不自然延长的声音。 * **音频平滑：** 为防止出现“咔哒”声，它将剪辑点与波形的过零点对齐，并使用可变长度的交叉淡入淡出处理。 * **底噪匹配：** 它通过循环播放一段环境噪音样本来保持背景噪声的一致性，从而避免在剪辑过程中出现干扰性的背景底噪突变。 * **智能处理：** 它采用了“混合”模式，在原始音频上进行检测，同时对降噪后的版本进行剪辑，从而确保在不丢失声学线索的前提下实现精准处理。 *erm* 的设计初衷是保持语言的完整性，忽略重复词或迟疑短语，以保留说话者的原意。该工具在本地运行以确保隐私，可通过 `uvx` 或 `pip` 安装使用。

```Hacker News新消息 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录为什么从录音中删除“嗯”比听起来要难（doug.sh）11 点由 dougcalobrisi 1 小时前 | 隐藏 | 过往 | 收藏 | 2 条评论帮助 rindalir 1 分钟前 | 下一条 [–] 这太迷人了！我打算在《侏罗纪公园》的某个片段上试一下。回复dougcalobrisi 1 小时前 | 上一条 [–] 这篇文章主要是关于从语音中干净利落地剪掉填充词有多么出人意料地难。显然，剔除“嗯”并不是一种“查找并替换”的操作，因为 Whisper 的时间戳会有几百毫秒的偏差，直接在那儿剪辑会切掉音节或留下结巴。所以我构建了一个工具 erm，它以 Whisper 的猜测为起点，找出每个单词在音频中实际的起止点，并将剪辑点对齐到静音处，从而避免产生咔哒声，最后通过 ffmpeg 进行拼接。https://github.com/dougcalobrisi/erm回复准则 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索： ```

原文

Linguists have a word for the ums, uhs, ers, and elongated versions (ummmm, uhhhhh) that pad spoken English: disfluencies.

I don’t record a lot of voice audio, but a few friends do, and they tell me editing those out by hand is miserable. So I built erm to do it.

That’s the whole interface for the common case. It writes a cleaned .wav and a JSON cut list next to the input. This post walks through how it works, because the obvious approach doesn’t sound very good and most of the code is the stuff that fixes that.

You’d expect the job to be: transcribe with word-level timestamps, find tokens like um and uh, cut those ranges with ffmpeg.

That gets you maybe 60% of the way, and the result sounds worse than the original. Three reasons:

Whisper quietly leaves a lot of fillers out of the transcript, so there’s no um token to match in the first place.
Slicing audio at an arbitrary point in time produces a tiny step in the waveform. Your ear hears it as a click.
Even when the splice itself is clean, the background hiss before and after the cut doesn’t quite match, so you hear a faint shift at every edit.

Most of erm is the work of fixing those three things.

Whisper is OpenAI’s open-source speech-to-text model. You hand it audio, it hands you back a transcript, and with the right flag it’ll also tell you the start and end timestamp of every word. It runs locally, which is what makes a tool like this possible without sending your recordings anywhere.

erm uses faster-whisper , a reimplementation that’s several times faster than the reference one and uses less memory. Same model weights, same output, just a better runtime. The default is the medium.en model, which is a good speed/accuracy balance. You can override with --model if you want small.en (faster), but I’d actually reach for large-v3. It’s noticeably better at picking up fillers and worth the extra compute.

First, run Whisper. erm asks for word-level timestamps and gives it a small instruction up front telling it not to clean up the transcript. Whisper, left alone, will edit out fillers because most of its training transcripts are clean prose. Any word that comes back as a known filler (um, uh, er, etc.) is flagged for cutting. Elongated versions like ummmm get matched against the um stem on the fly.

Whisper still misses things, so three more passes look at the audio directly:

Gap fillers. If there’s an unusually long pause between two transcribed words (more than 350ms by default), erm checks whether somebody is actually making a sound during that “pause.” If a chunk of voice is sitting inside what Whisper marked as silence, that’s a filler Whisper deleted entirely. It really does just drop them. No token at all, just a hole in the transcript where an um used to be.

Fillers hiding inside a word. Whisper sometimes glues a filler onto an adjacent word, so "in, uhhhhh" comes back as a single in token. erm looks at long single-token words, splits them at brief dips in the audio, figures out which chunk is the actual word (based on how long that word should reasonably take to say), and treats the rest as filler.

Words that are much too long. If a word lasts way longer than its text could plausibly take to pronounce, the tail end is suspicious. erm scans the tail for voiced sound, and optionally double-checks with a pitch test: does the suspicious chunk sound like someone holding a vowel (uhhhhh), or like someone just speaking slowly? A held vowel has a steady, simple acoustic shape; real speech is constantly changing as you move between sounds. The pitch test keeps the tool from trimming slow talkers.

All four passes (the Whisper one and the three audio ones) produce candidate cuts independently, and the lists get merged before the next step.

A cut at exactly t = 1.234s lands wherever the waveform happens to be at that instant, almost never at zero. Stitching two arbitrary points together leaves a step in the waveform, and that step is the click you hear.

Two small fixes, in order. First, each cut endpoint is allowed to slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If there’s a momentary lull in the audio just before or after the original cut point, slide there. The slide is bounded so it can’t cross into a neighboring word, otherwise you’d chew off real speech. Second, from that quiet spot, the endpoint snaps to the nearest moment when the waveform is exactly crossing zero. Two zero points stitched together produce a continuous waveform with no step, and no click.

After all that, very short surviving fragments get cleaned up: if two adjacent cuts would leave a sliver of audio shorter than about 120ms between them, the sliver gets merged into one bigger cut. A fragment that small can’t survive the smoothing on either side anyway and just sounds like a blip.

ffmpeg does the actual stitching using a crossfade. Instead of butting the two pieces of audio together, it briefly overlaps them and fades one out as the other fades in. That smooths over any remaining mismatch.

The trick is picking how long to overlap. A fixed length (most tutorials say 80ms or so) sounds wrong both ways: short cuts get smeared together, long cuts still pop. erm scales the length to the size of the cut: a tiny clip of uh gets a short crossfade, a long ummmmm gets a longer one. There’s a floor and ceiling (50ms to 120ms), and the crossfade is never allowed to reach back across the start of a real word, which would muddy the speech on either side.

Even after all of the above, the background hiss of the recording (the ambient sound of the room when nobody’s talking) doesn’t perfectly match across cuts. Every room has a slightly different “silence,” and stitching two near-silences together still produces a faint shift you can hear.

The fix is dumb but it works. Find a quiet stretch in the original recording (a real piece of “this room when nobody’s talking”) and loop it underneath the entire output at low volume. Now the background is identical everywhere, because it’s the same loop everywhere. Any small mismatch at each splice gets covered up by the steady tone sitting on top.

By default the quiet stretch is found automatically. You can also point it at a specific time range if you know a good one.

ffmpeg has a built-in noise reducer, and you can run it on the audio at various points in the pipeline. The catch: denoising smooths out the very details (volume bumps and pitch wiggles) that the detectors rely on to find fillers. So it matters when you do it.

erm has four modes:

Mode	Detection looks at	The output is cut from
`none`	the original	the original
`pre`	a denoised copy	the denoised copy
`post`	the original	the original; denoised at the end
`hybrid`	the original	a denoised copy

hybrid is the default, and the one you want: detection runs on the original audio (so it can see all the cues), but the actual cuts come from a clean, denoised copy (so the splices sound nice).

pre looks sensible but is the worst option, because running the detectors on denoised audio hides the very things they’re looking for.

Audio renders can break in subtle ways, so there’s a validate subcommand:

uvx erm validate input.wav cleaned.wav --cuts cuts.json

It runs three checks:

The output file actually opens.
The output is shorter than the input by roughly the total length of the cuts (within a small margin).
When you transcribe the cleaned file back to text, no fillers come back.

That last one is the useful one. It’s end-to-end: it tells you the tool actually did what it claimed.

It leaves like, you know, and I mean alone. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said. The rule erm follows: only remove things that are sound, not language.

It also doesn’t touch repeated words, false starts, or long thinking pauses. Those aren’t noise on top of the speech; they are the speech, just messier than the speaker would like. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that.

The quickest way is with uv , which fetches and runs the tool in one step without a permanent install:

uvx erm input.wav --dry-run     # see what would be cut
uvx erm input.wav               # render

If you’d rather install it the usual way:

pip install erm                 # or: pipx install erm
erm input.wav

You’ll also need ffmpeg and ffprobe on your PATH (brew install ffmpeg on macOS).

github.com/dougcalobrisi/erm . Audio stays local. If you record voice notes or podcasts and your every other word is um, give it a try.

为什么从录音中去除“嗯”声比听起来要难 Removing 'um' from a recording is harder than it sounds

为什么从录音中去除“嗯”声比听起来要难
Removing 'um' from a recording is harder than it sounds