StutterZero:为口吃转录和矫正的语音转换
StutterZero: Speech Conversion for Stuttering Transcription and Correction

原始链接: https://arxiv.org/abs/2510.18938

## StutterZero & StutterFormer:用于口吃问题的端到端语音转换 这项研究介绍了StutterZero和StutterFormer,两种新颖的端到端人工智能模型,旨在直接将口吃语音转换为流畅语音,*同时*进行语音转录。当前的方法在准确处理不流利语音方面存在困难,通常依赖于复杂的多阶段流程。这些新模型绕过了这种复杂性。 StutterZero利用卷积-双向LSTM架构,而StutterFormer采用双流Transformer。两者均在合成和真实的口吃语音数据上进行训练。 在未见过说话者上的评估表明,与现有的Whisper-Medium模型相比,有了显著的改进:StutterZero实现了24%的转录错误减少(WER)和31%的语义相似度提升,而StutterFormer则进一步将这些结果提升至28%和34%。 这项工作展示了直接、端到端口吃校正的潜力,为更具包容性的语音技术、改进的语音治疗工具和易于访问的人工智能系统铺平了道路。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 StutterZero:口吃转录和校正的语音转换 (arxiv.org) 5 分,来自 internetguy 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

View a PDF of the paper titled StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction, by Qianheng Xu

View PDF HTML (experimental)
Abstract:Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.
From: Qianheng Xu Mr. [view email]
[v1] Tue, 21 Oct 2025 17:54:36 UTC (9,663 KB)
[v2] Wed, 5 Nov 2025 00:00:48 UTC (9,657 KB)
联系我们 contact @ memedata.com