Chrome扩展程序根据说话速度调整视频速度。

Chrome扩展程序根据说话速度调整视频速度。
Chrome extension adjusts video speed based on how fast the speaker is talking

原始链接: https://github.com/ywong137/speech-speed

## 动态视频速度调整 Chrome 扩展 - 摘要这个 Chrome 扩展程序智能地调整视频播放速度，以规范语速，从而在不牺牲理解的情况下加快内容消费速度。它通过使用 Web Audio API 分析页面上最大的视频元素的音频轨道来实现。该扩展程序的核心在于其音节速率检测算法。在隔离关键元音频率（300-3000Hz）后，它分析能量包络调制——特别是过零率——来估计说话者说话的速度。以前使用峰值计数和 AudioWorklets 的迭代由于语音特征和浏览器安全策略而面临限制。检测到的速率随后用于计算目标播放速度，旨在达到每秒 9 个音节的稳定速率（用户可调整）。速度变化被平滑处理，以避免突兀的过渡。用户界面提供对目标速率、最小/最大速度的控制，并显示实时数据，例如当前速度、估计的音节速率和能量水平。局限性包括与 DRM 保护的内容不兼容以及音乐或多个说话者可能出现不准确的情况。未来的改进可能包括语音活动检测和针对每个说话者的自适应调整。

一个Chrome扩展程序旨在根据说话者的语速自动调整视频播放速度，从而可能为观众节省时间。该扩展程序发布在Hacker News上，引发了不同的反应。虽然被赞赏为一个巧妙的想法，但人们也对安装浏览器扩展程序相关的安全风险表示担忧，因为过去曾发生过恶意接管事件。一些用户报告了功能问题，例如扩展程序错误计算语速或根本无法检测到语音。另一些人则表示希望有工具可以简单地*跳过*视频的介绍部分，从而推荐了“Sponsorblock”扩展程序，该程序通过众包来获取跳过介绍和广告的时间戳。还建议使用Gemini等工具来总结视频。一位用户甚至提议创建一个扩展程序，以自动将视频快进到特定时间戳。

原文

A Chrome extension that dynamically adjusts video playback speed based on how fast the speaker is talking. Slow speakers get sped up more; fast speakers get sped up less. The goal is to normalize all speech to a comfortable listening rate so you can consume video content faster without it becoming unintelligible during rapid passages.

The extension's content script finds the largest <video> element on the page and taps its audio output using HTMLMediaElement.captureStream(). This returns a MediaStream without interrupting normal playback. The stream is fed into a Web Audio API graph:

video.captureStream()
  → MediaStreamSource
    → BiquadFilter (highpass 300 Hz)
      → BiquadFilter (lowpass 3000 Hz)
        → AnalyserNode (polled at ~33 Hz)

The 300-3000 Hz bandpass isolates the vowel formant region where syllable energy is concentrated, rejecting low-frequency rumble and high-frequency noise.

The core algorithm measures syllable rate by analyzing modulation in the energy envelope. It went through three iterations:

v1 (abandoned): Threshold peak-counting. Computed a smoothed RMS envelope and counted peaks above an adaptive threshold. Failed because fast continuous speech maintains high energy without deep dips between syllables, so the detector saw one long above-threshold region instead of individual syllables. This systematically undercounted fast speech.

v2 (abandoned): AudioWorklet-based detection. The same peak-counting algorithm running in an AudioWorklet for off-main-thread processing. Blocked by YouTube's Content Security Policy, which rejects blob: URLs for script loading.

v3 (current): Energy-envelope modulation analysis. Instead of looking for individual peaks, this approach isolates the syllable-rate modulation of the energy envelope using a high-pass filter, then counts zero-crossings of the filtered signal.

The algorithm:

Compute RMS energy over each 2048-sample window (~21 ms at 96 kHz) via AnalyserNode.getFloatTimeDomainData().
Smooth envelope with a first-order IIR low-pass (alpha=0.3) for silence detection and display.
High-pass filter the raw RMS to remove DC and slow-varying level, leaving only syllable-rate modulation (2-10 Hz). This is a first-order IIR high-pass:
```
filtered[n] = alpha * (filtered[n-1] + rms[n] - rms[n-1])
```
With alpha=0.9 at 33 Hz polling, the cutoff is ~0.5 Hz. Everything above 0.5 Hz passes through, including the 2-10 Hz syllable modulation.
Count positive-going zero-crossings of the filtered signal. Each time the filtered energy rises from below zero to above zero, one syllable nucleus (vowel) has been detected. A minimum interval of 70 ms between crossings caps the maximum detectable rate at ~14 syl/s.
Compute syllable rate as crossings per second over a 4-second sliding window.

This approach works because even during continuous fast speech, the energy envelope oscillates between syllable nuclei (vowels, high energy) and inter-syllabic transitions (consonants, lower energy). The high-pass filter extracts this oscillation regardless of absolute level, and zero-crossing counting is inherently robust to amplitude variations.

The detected syllable rate is converted to a playback speed:

naturalRate = measuredRate / currentPlaybackSpeed
targetSpeed = targetSyllableRate / naturalRate
targetSpeed = clamp(targetSpeed, minSpeed, maxSpeed)
currentSpeed += smoothingAlpha * (targetSpeed - currentSpeed)

The measuredRate / currentPlaybackSpeed correction accounts for the fact that captureStream() reflects the current playback rate: if the video plays at 2x, the audio arrives at 2x speed, so measured syllables come twice as fast. Dividing by the current speed recovers the speaker's natural rate.

With the default target of 9 syl/s:

Speaker pace	Natural rate	Playback speed
Very slow	~2.5 syl/s	3.50x (capped)
Slow	~3.0 syl/s	3.00x
Normal	~4.5 syl/s	2.00x
Fast	~6.0 syl/s	1.50x
Very fast	~8.0 syl/s	1.12x

During silence (low energy or low syllable rate for more than 3 seconds), the speed gradually drifts back toward 1x.

Speed changes use an exponential moving average (alpha=0.25, updated ~33 times/sec) to prevent jarring jumps. This gives a time constant of roughly 1 second: fast enough to adapt when speakers change, slow enough that brief pauses don't cause stuttery speed oscillation.

Clone or download this repository
Open chrome://extensions in Chrome
Enable Developer mode (top right)
Click Load unpacked and select the speech-speed directory
Navigate to any page with a video, click the extension icon, and toggle ON

File	Purpose
`manifest.json`	Manifest V3 extension config
`content.js`	Content script: audio capture, syllable detection, speed control, diagnostic overlay
`popup.html/js/css`	Extension popup: on/off toggle and settings sliders
`syllable-worklet.js`	(Legacy, unused) AudioWorklet from v2 approach

The popup exposes three sliders:

Target rate (4-14 syl/s, default 9): the desired effective syllable rate you hear. Higher = more speedup across the board. Lower = gentler.
Min speed (1-2x, default 1.0): floor for playback rate. Set to 1.0 to never slow below normal.
Max speed (1.5-5x, default 3.5): ceiling for playback rate. Protects against extreme speedup on very slow speakers.

Settings persist in chrome.storage.local.

When enabled, a dark panel appears in the top-right corner of the page showing:

Speed (large green number): current playback multiplier
syl/s (large blue number): estimated natural syllable rate of the speaker
State: SPEAKING (green) or SILENCE with duration (amber)
Energy bar: real-time audio energy level
History log: scrolling list of rate-to-speed mappings, one per second, most recent at bottom (bold). Lets you see at a glance how the speed changes as different speakers talk.
Stats: zero-crossing count, poll tick count, pipeline stage

All tunable parameters are in content.js. Here are the ones most worth adjusting:

Detector parameters (in the `detector` object)

Parameter	Default	What it does	How to tune
`hpAlpha`	0.9	High-pass filter coefficient. Higher = lower cutoff = passes more low-frequency modulation.	If detecting too many spurious syllables in music/ambient noise, lower toward 0.8. If missing syllables in very slow speech (<2 syl/s), raise toward 0.95.
`minCrossingInterval`	70 ms	Minimum time between detected syllables. Caps max detectable rate at ~14 syl/s.	Raise to 100 ms if getting false positives from music. Lower to 50 ms if very fast speech (>10 syl/s natural) is being undercounted.
`minEnergy`	0.003	Absolute energy floor for silence gating. Below this, no syllables are counted.	Raise if background noise or music is triggering false detections. Lower if quiet speech is being missed. Inspect the "Energy" readout in the overlay.
`envelopeAlpha`	0.3	Smoothing for the envelope used in silence detection.	Rarely needs changing. Higher = more responsive silence detection, lower = more sluggish.
`windowSize`	4000 ms	Sliding window for rate computation.	Shorter (2000 ms) = faster adaptation to speaker changes, but noisier rate estimate. Longer (6000 ms) = smoother but slower to react.

Speed control parameters (in `DEFAULTS`)

Parameter	Default	What it does	How to tune
`targetRate`	9 syl/s	Desired effective syllable rate. The speed adjusts so you hear syllables at roughly this rate.	This is the primary dial. 7 = moderate speedup. 9 = aggressive. 12 = very aggressive. Normal English speech is ~4-5 syl/s.
`smoothing`	0.25	EMA alpha for speed changes.	Lower (0.1) = slower, more gradual transitions. Higher (0.4) = snappier but potentially jittery.
`silenceHoldSec`	3	Seconds of silence before speed starts drifting toward 1x.	Raise if speed is resetting during natural pauses in speech. Lower if speed hangs too long after someone stops talking.
`minSpeed` / `maxSpeed`	1.0 / 3.5	Speed clamps.	Adjust based on your tolerance. Some people can track 4-5x; others top out at 2x.

The highpass (300 Hz) and lowpass (3000 Hz) in the audio pipeline isolate the vowel formant region. If you're seeing poor detection on content with unusual audio characteristics:

Lots of bass bleed triggering false syllables: raise highpass to 400-500 Hz
Missing high-pitched speakers: raise lowpass to 4000 Hz
Sibilance ("s", "sh") being counted as syllables: lower lowpass to 2500 Hz

The setInterval in setupAudio runs at 30 ms (~33 Hz). This is well above the Nyquist rate for syllable detection (max ~14 syl/s = 28 Hz Nyquist). Lowering to 50 ms saves CPU at the cost of reduced maximum detectable rate. Raising to 15 ms provides finer time resolution but minimal practical benefit.

DRM-protected content (Netflix, Disney+, etc.): captureStream() is blocked on encrypted media. The extension will not work on these sites.
Music and sound effects: The detector may count rhythmic beats as syllables. The bandpass filter and silence gating help, but aren't perfect. A Voice Activity Detection model (e.g., Silero VAD at ~2 MB) could improve this.
Multiple simultaneous speakers: The detector measures aggregate syllable rate across all speakers. This is a reasonable degradation since overlapping speech is hard to parse at any speed.
Non-speech audio content: Podcasts with heavy background music or sound design may see inaccurate rate detection.
SPA navigation: On YouTube and similar single-page apps, the extension re-attaches when the video element changes. There may be a brief 1-2 second gap during navigation.

Silero VAD integration: Adding the ~2 MB Silero VAD model via onnxruntime-web could gate the detector to only count syllables during confirmed speech, ignoring music and ambient sound.
Autocorrelation-based rate estimation: Instead of counting zero-crossings, compute the autocorrelation of the energy envelope to find the dominant modulation frequency. This would be more robust to noise and provide smoother rate estimates.
Per-speaker adaptation: If combined with speaker diarization, the extension could maintain separate rate estimates for each speaker and switch speeds instantly when the active speaker changes.
Keyboard shortcut toggle: Add chrome.commands for quick enable/disable without opening the popup.

MIT