展示HN:我训练了一个900万参数的语音模型来纠正我的普通话声调。
Show HN: I trained a 9M speech model to fix my Mandarin tones

原始链接: https://simedw.com/2026/01/31/ear-pronunication-via-ctc/

## 使用自定义模型进行普通话发音训练 作者在普通话发音(尤其是声调)方面遇到困难,因此构建了一个计算机辅助发音训练 (CAPT) 系统,以提供个性化反馈。现有的商业选择并不理想,所以他们开发了一个完全在浏览器中运行的解决方案。 该系统利用一个使用连接时序分类 (CTC) 损失函数训练的 Conformer 编码器——这是一种专门的自动语音识别 (ASR) 方法,侧重于*如何*说,而不仅仅是*说什么*。它不是自动纠正错误,而是通过逐帧分析音频来识别精确的发音错误。一个关键特性是将声母韵母*与*声调一起进行标记,使模型能够精确定位声调不准确之处。 该模型在约 300 小时的语音数据上进行训练,即使显著减小尺寸(减少到 9M 参数,几乎没有准确性损失),仍然出奇地有效。一个关键的修复涉及在对齐过程中过滤掉静音帧,以避免将停顿误解为不正确的音节。 由此产生的工具提供了严格且毫不留情的反馈,证明对作者的学习非常有益。虽然母语者和儿童目前需要更清晰的发音,但未来的改进将侧重于整合更多样化的数据集,例如口语。提供了一个在线演示,可以试用——大约 13MB 的下载可以在您的浏览器中直接运行。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 展示 HN:我训练了一个 9M 的语音模型来纠正我的普通话声调 (simedw.com) 11 分,由 simedw 发表于 39 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 我构建了这个,因为声调正在毁掉我的口语普通话,而且我无法可靠地听到自己的错误。 这是一个 9M Conformer-CTC 模型,在 ~300 小时 (AISHELL + Primewords) 上训练,量化为 INT8 (11 MB),通过 ONNX Runtime Web 在浏览器中 100% 运行。 对每个音节的发音 + 声调进行 Viterbi 强制对齐评分。 在这里试用:https://simedw.com/projects/ear/ jellojello 4 分钟前 [–] 这太棒了,如果你觉得可以打开一门语言,使其更容易学习……波斯语是一种非常被忽视的语言,我妻子/她的家人说它,但很难找到很棒的语言课程(它也被称为波斯语/达里语)回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

TL;DR: Mandarin pronunciation has been hard for me, so I took ~300 hours of transcribed speech and trained a small CTC model to grade my pronunciation. You can try it here.

In my previous post about Langseed, I introduced a platform for defining words using only vocabulary I had already mastered. My vocabulary has grown since then, but unfortunately, people still struggle to understand what I'm saying.

Part of the problem is tones. They're fairly foreign to me, and I'm bad at hearing my own mistakes, which is deeply frustrating when you don’t have a teacher.

First attempt: pitch visualisation

My initial plan was to build a pitch visualiser: split incoming audio into small chunks, run an FFT, extract the dominant pitch over time, and map it using an energy-based heuristic, loosely inspired by Praat.

But this approach quickly became brittle. There were endless special cases: background noise, coarticulation, speaker variation, voicing transitions, and so on.

And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.

So instead, I decided to build a deep learning–based Computer-Assisted Pronunciation Training (CAPT) system that could run entirely on-device. There are already commercial APIs that do this, but hey, where’s the fun in that?

Architecture

I treated this as a specialised Automatic Speech Recognition (ASR) task. Instead of just transcribing text, the model needs to be pedantic about how something was said.

I settled on a Conformer encoder trained with CTC (Connectionist Temporal Classification) loss.

Why Conformer?

Speech is weird: you need to catch both local and global patterns:

  1. Local interactions
    The difference between a retroflex zh and an alveolar z happens in a split second. CNNs are excellent at capturing these short-range spectral features.

  2. Global interactions
    Mandarin tones are relative (a "high" pitch for me might be low for a child) and context-dependent (tone sandhi). Transformers excel at modeling this longer-range context.

Conformers combine both: convolution for local detail, attention for global structure.

Why CTC?

Most modern ASR models (e.g. Whisper) are sequence-to-sequence: they turn audio into the most likely text. The downside is they'll happily auto-correct you.

That’s a feature for transcription, but it’s a bug for language learning. If my tone is wrong, I don’t want the model to guess what I meant. I want it to tell me what I actually said.

CTC works differently. It outputs a probability distribution for every frame of audio (roughly every 40 ms). To handle alignment, it introduces a special <blank> token.

If the audio is "hello", the raw output might look like:

h h h <blank> e e <blank> l l l l <blank> l l o o o

Collapsing repeats and removing blanks gives hello. This forces the model has to deal with what I actually said, frame by frame.

Forced alignment: knowing when you said it

CTC tells us what was said, but not exactly when.

For a 3-second clip, the model might output a matrix with ~150 time steps (columns), each containing probabilities over all tokens (rows). Most of that matrix is just <blank>.

If the user reads "Nǐ hǎo" (ni3, hao3), we expect two regions of high probability: one for ni3, one for hao3.

We need to find a single, optimal path through this matrix that:

  • Starts at the beginning
  • Ends at the end
  • Passes through ni3hao3 in order
  • Maximises total probability

This is exactly what the Viterbi algorithm computes, using dynamic programming.

Tokenisation: Pinyin + tone as first-class tokens

Most Mandarin ASR systems output Hanzi. That hides pronunciation errors, because the writing system encodes meaning rather than pronunciation.

Instead, I created a token for every Pinyin syllable + tone:

  • zhong1 is one token
  • zhong4 is a completely different token

If I say the wrong tone, the model explicitly predicts the wrong token ID.

I also normalised the neutral tone by forcing it to be tone 5 (ma5). This resulted in a vocabulary of 1,254 tokens, plus <unk> and <blank>.

Training

I combined the AISHELL-1 and Primewords datasets (~300 hours total), augmented by SpecAugment (time/frequency masking). On 4× NVIDIA GeForce RTX 4090s, training took about 8 hours. Instead of obsessing over loss, I mostly focused on these metrics:

  1. TER (Token Error Rate): overall accuracy.
  2. Tone Accuracy: accuracy over tones 1-5.
  3. Confusion Groups: errors between difficult initial pairs like zh/ch/sh vs z/c/s.

Validation TER score during training Validation Tone confusion matrix during training

Honey, I shrank the model

I started with a "medium" model (~75M parameters). It worked well, but I wanted something that could run in a browser or on a phone without killing the battery.

So I kept shrinking it, and I was honestly surprised by how little accuracy I lost:

# Parameters TER Tone accuracy
75M 4.83% 98.47%
35M 5.16% 98.36%
9M 5.27% 98.29%

The 9M-parameter model was barely worse. This strongly suggests the task is data-bound, not compute-bound.

The FP32 model was ~37 MB. After INT8 quantisation, it shrank to ~11 MB with a negligible accuracy drop (+0.0003 TER). Small enough to load instantly via onnxruntime-web.

Alignment bug: silence ruins everything

To highlight mistakes, we need forced alignment. But I hit a nasty bug with leading silence.

I recorded myself saying "我喜欢…" and paused for a second before speaking. The model confidently told me my first syllable was wrong. Confidence score: 0.0.

Why?

The alignment assigned the silent frames to wo3. When I averaged probabilities over that span, the overwhelming <blank> probability completely drowned out wo3.

The fix

I decoupled UI spans (what gets highlighted) from scoring frames (what contributes to confidence).

We simply ignore frames where the model is confident it’s seeing silence:

def _filter_nonblank_frames(span_logp: torch.Tensor, blank_id: int = 0, thr: float = 0.7):
    """
    Only keep frames where the probability of <blank> is below a threshold.
    If we filter everything (total silence), we fall back to scoring the whole span.
    """
    p_blank = span_logp[:, blank_id].exp()
    keep = p_blank < thr
    if keep.any():
        return span_logp[keep]
    return span_logp  # Fallback

This single change moved my confidence score for the first syllable from 0.0 → 0.99.

Conclusion

I can already feel my pronunciation improving while beta testing this. It’s strict and unforgiving, exactly what I needed.

Native speakers, interestingly, complained that they had to over-enunciate to get marked correct. That’s likely a domain-shift issue: AISHELL is mostly read speech, while casual speech is faster and more slurred. Kids do poorly too: their pitch is higher, and they're basically absent from the training data. Adding conversational datasets like Common Voice feels like the obvious next step.

You can try the live demo here. It runs entirely in your browser. The download is ~13MB, still smaller than most websites today.

联系我们 contact @ memedata.com