Mistral 的 Voxtral Mini 4B 实时实现，在你的浏览器中运行。

Mistral 的 Voxtral Mini 4B 实时实现，在你的浏览器中运行。
Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

原始链接: https://github.com/TrevorS/voxtral-mini-realtime-rs

## Voxtral Mini 4B：浏览器中的实时语音识别本项目使用纯 Rust 实现了 Mistral 的 Voxtral Mini 4B 实时语音识别模型，利用 Burn ML 框架。一个关键的成果是在 Web 浏览器中使用 WASM 和 WebGPU *完全在客户端* 运行量化模型（Q4 GGUF，2.5GB）。HuggingFace Spaces 上提供了一个托管演示。该系统通过 Mel 频谱图、因果编码器和自回归解码器将音频（16kHz 单声道）转换为文本。为了实现浏览器执行，克服了内存限制（模型权重分片、优化嵌入）和 GPU 限制等挑战。一个关键的修复包括增加音频填充，以解决语音立即开始时量化敏感性问题。用户可以下载模型权重，通过命令行转录音频文件，或构建 WASM 包以进行浏览器部署。该项目包括全面的测试，并利用了 GPU 加速（通过 `wgpu`）和 HuggingFace Hub 集成。未来的工作包括基准测试准确性和推理速度。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Rust 实现的 Mistral 的 Voxtral Mini 4B 实时运行在你的浏览器中 (github.com/trevors) 5 分，由 Curiositry 1 小时前发布 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

原文

Streaming speech recognition running natively and in the browser. A pure Rust implementation of Mistral's Voxtral Mini 4B Realtime model using the Burn ML framework.

The Q4 GGUF quantized path (2.5 GB) runs entirely client-side in a browser tab via WASM + WebGPU. Try it live.

# Download model weights (~9 GB)
uv run --with huggingface_hub \
  hf download mistralai/Voxtral-Mini-4B-Realtime-2602 --local-dir models/voxtral

# Transcribe an audio file (f32 SafeTensors path)
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio audio.wav --model models/voxtral

# Or use the Q4 quantized path (~2.5 GB)
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio audio.wav --gguf models/voxtral-q4.gguf --tokenizer models/voxtral/tekken.json

# Build WASM package
wasm-pack build --target web --no-default-features --features wasm

# Generate self-signed cert (WebGPU requires secure context)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
  -keyout /tmp/voxtral-key.pem -out /tmp/voxtral-cert.pem \
  -days 7 -nodes -subj "/CN=localhost"

# Start dev server
bun serve.mjs

Open https://localhost:8443, accept the certificate, and click Load from Server to download the model shards. Record from your microphone or upload a WAV file to transcribe.

Hosted demo on HuggingFace Spaces if you want to skip local setup.

Audio (16kHz mono)
  -> Mel spectrogram [B, 128, T]
    -> Causal encoder (32 layers, 1280 dim, sliding window 750)
      -> Conv 4x downsample -> Reshape [B, T/16, 5120]
        -> Adapter [B, T/16, 3072]
          -> Autoregressive decoder (26 layers, 3072 dim, GQA 32Q/8KV)
            -> Token IDs -> Text

	F32 (native)	Q4 GGUF (native + browser)
Weights	SafeTensors (~9 GB)	GGUF Q4_0 (~2.5 GB)
Linear ops	Burn tensor matmul	Custom WGSL shader (fused dequant + matmul)
Embeddings	f32 tensor (1.5 GiB)	Q4 on GPU (216 MB) + CPU bytes for lookups
Browser	No	Yes (WASM + WebGPU)

The upstream mistral-common library left-pads audio with 32 silence tokens (at 12.5 Hz). After the mel/conv/reshape pipeline, this covers only 16 of the 38 decoder prefix positions with silence — the remaining 22 contain actual audio. The f32 model handles this fine, but Q4_0 quantization makes the decoder sensitive to speech content in the prefix: audio that starts immediately with speech (mic recordings, clips with no leading silence) produces all-pad tokens instead of text.

The left padding is increased to 76 tokens, which maps to exactly 38 decoder tokens of silence and covers the full streaming prefix. See src/audio/pad.rs for details.

Running a 4B model in a browser tab required solving five hard constraints:

2 GB allocation limit — ShardedCursor reads across multiple Vec<u8> buffers
4 GB address space — Two-phase loading: parse weights, drop reader, then finalize
1.5 GiB embedding table — Q4 embeddings on GPU + CPU-side row lookups
No sync GPU readback — All tensor reads use into_data_async().await
256 workgroup invocation limit — Patched cubecl-wgpu to cap reduce kernel workgroups

# Native (default features: wgpu + native-tokenizer)
cargo build --release

# With all features
cargo build --release --features "wgpu,cli,hub"

# WASM
wasm-pack build --target web --no-default-features --features wasm

Feature	Description
`wgpu` (default)	GPU backend via Burn/CubeCL (WebGPU, Vulkan, Metal)
`native-tokenizer` (default)	Tekken tokenizer (C deps, not WASM-compatible)
`wasm`	Browser support: wasm-bindgen, WebGPU device init, JS bindings
`cli`	CLI binary with clap + indicatif
`hub`	HuggingFace Hub model downloads

# Unit + integration tests (requires GPU for full suite)
cargo test --features "wgpu,cli,hub"

# Lint
cargo clippy --features "wgpu,cli,hub" -- -D warnings
cargo clippy --no-default-features --features wasm --target wasm32-unknown-unknown -- -D warnings

# E2E browser test (requires Playwright + model shards)
bunx playwright test tests/e2e_browser.spec.ts

GPU-dependent tests (model layer shapes, Q4 matmul, WGSL shader correctness) are skipped in CI since GitHub Actions runners lack a GPU adapter. These tests run locally on any machine with Vulkan, Metal, or WebGPU support.

Q4 GGUF Sharding (for browser)

The GGUF file must be split into shards of 512 MB or less to stay under the browser's ArrayBuffer limit:

split -b 512m models/voxtral-q4.gguf models/voxtral-q4-shards/shard-

The dev server and E2E test discover shards automatically from models/voxtral-q4-shards/.

Coming soon: accuracy (WER) and inference speed benchmarks across native and browser targets.

src/
  audio/          # Mel spectrogram, chunking, resampling, padding
  models/         # F32 model: encoder, decoder, adapter, attention, RoPE, KV cache
  gguf/           # Q4 GGUF: reader, loader, model, tensor, WGSL shader, tests
  web/            # WASM bindings: VoxtralQ4, initWgpuDevice, async decode loop
  tokenizer/      # Tekken tokenizer wrapper (native only)
  bin/transcribe  # CLI binary

web/              # Browser demo: index.html, worker.js, voxtral-client.js
tests/            # Integration tests + Playwright E2E spec
scripts/          # Dev scripts: reference implementations, weight inspection, E2E helpers
patches/          # cubecl-wgpu workgroup size fix for WebGPU

Apache-2.0

Mistral 的 Voxtral Mini 4B 实时实现，在你的浏览器中运行。 Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

Q4 GGUF Sharding (for browser)

Mistral 的 Voxtral Mini 4B 实时实现，在你的浏览器中运行。
Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser