Parakeet.cpp – 使用 Metal GPU 加速的纯 C++ Parakeet 自动语音识别推理

Parakeet.cpp – 使用 Metal GPU 加速的纯 C++ Parakeet 自动语音识别推理
Parakeet.cpp – Parakeet ASR inference in pure C++ with Metal GPU acceleration

原始链接: https://github.com/Frikallo/parakeet.cpp

## 鹦鹉：快速、离线和流式语音识别 C++ 工具包鹦鹉是一个高性能的语音识别工具包，完全用 C++ 构建，利用轻量级的 `axiom` 张量库，并在 Apple Silicon 上实现自动 Metal GPU 加速。它绕过了 Python 和 ONNX 等依赖项，实现了显著的速度提升——高达 110M 模型 CPU 推理速度的 96 倍。 **主要特点：** * **多种模型：** 提供离线 (TDT-CTC 110M/600M) 和流式 (EOU 120M, Nemotron 600M) ASR 模型，以及用于说话人分段的 Sortformer 模型。 * **快速推理：** 在 Apple Silicon 上实现 10 秒音频 (110M 模型) 的约 27 毫秒编码器推理。 * **易于使用：** 简单的 API，只需一个头文件 (`#include `)。 * **灵活性：** 支持解码器选择 (CTC, TDT)、词级别时间戳和完全的流水线控制。 * **GPU 加速：** 利用 `axiom` 的 Metal 图编译器实现优化的性能。 **入门：** 需要 C++20，可以使用 `make` 构建。预训练模型可通过 Hugging Face 获取，并可以使用提供的脚本进行转换。该工具包支持各种命令行选项，用于模型选择、GPU 使用和输出格式化。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Parakeet.cpp – 使用 Metal GPU 加速的纯 C++ Parakeet ASR 推理 (github.com/frikallo) 7 分，noahkay13 43 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 noahkay13 43 分钟前 [–] 我使用我的张量库 Axiom (https://github.com/Frikallo/axiom) 构建了一个 NVIDIA Parakeet 语音识别模型的 C++ 推理引擎。它能做到： - 运行 7 个模型系列：离线转录 (CTC, RNNT, TDT, TDT-CTC)，流式 (EOU, Nemotron)，和说话人分离 (Sortformer) - 字级别时间戳 - 从麦克风输入进行流式转录 - 说话人分离，检测最多 4 个说话人指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Fast speech recognition with NVIDIA's Parakeet models in pure C++.

Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.

~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU.

Model	Class	Size	Type	Description
`tdt-ctc-110m`	`ParakeetTDTCTC`	110M	Offline	English, dual CTC/TDT decoder heads
`tdt-600m`	`ParakeetTDT`	600M	Offline	Multilingual, TDT decoder
`eou-120m`	`ParakeetEOU`	120M	Streaming	English, RNNT with end-of-utterance detection
`nemotron-600m`	`ParakeetNemotron`	600M	Streaming	Multilingual, configurable latency (80ms–1120ms)
`sortformer`	`Sortformer`	117M	Streaming	Speaker diarization (up to 4 speakers)

All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.

#include <parakeet/parakeet.hpp>

parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();  // optional — Metal acceleration

auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;

Choose decoder at call site:

auto result = t.transcribe("audio.wav", parakeet::Decoder::CTC);  // fast greedy
auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT);  // better accuracy (default)

Word-level timestamps:

auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT, /*timestamps=*/true);
for (const auto &w : result.word_timestamps) {
    std::cout << "[" << w.start << "s - " << w.end << "s] " << w.word << std::endl;
}

Offline Transcription (TDT-CTC 110M)

parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();
auto result = t.transcribe("audio.wav");

Offline Transcription (TDT 600M Multilingual)

parakeet::TDTTranscriber t("model.safetensors", "vocab.txt",
                            parakeet::make_tdt_600m_config());
auto result = t.transcribe("audio.wav");

Streaming Transcription (EOU 120M)

parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt",
                                  parakeet::make_eou_120m_config());

// Feed audio chunks (e.g., from microphone)
while (auto chunk = get_audio_chunk()) {
    auto text = t.transcribe_chunk(chunk);
    if (!text.empty()) std::cout << text << std::flush;
}
std::cout << t.get_text() << std::endl;

Streaming Transcription (Nemotron 600M)

// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms
auto cfg = parakeet::make_nemotron_600m_config(/*latency_frames=*/1);
parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);

while (auto chunk = get_audio_chunk()) {
    auto text = t.transcribe_chunk(chunk);
    if (!text.empty()) std::cout << text << std::flush;
}

Speaker Diarization (Sortformer 117M)

Identify who spoke when — detects up to 4 speakers with per-frame activity probabilities:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

auto wav = parakeet::read_wav("meeting.wav");
auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});
auto segments = model.diarize(features);

for (const auto &seg : segments) {
    std::cout << "Speaker " << seg.speaker_id
              << ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
}
// Speaker 0: [0.56s - 2.96s]
// Speaker 0: [3.36s - 4.40s]
// Speaker 1: [4.80s - 6.24s]

Streaming diarization with arrival-order speaker tracking:

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

parakeet::EncoderCache enc_cache;
parakeet::AOSCCache aosc_cache(4);  // max 4 speakers

while (auto chunk = get_audio_chunk()) {
    auto features = parakeet::preprocess_audio(chunk, {.normalize = false});
    auto segments = model.diarize_chunk(features, enc_cache, aosc_cache);
    for (const auto &seg : segments) {
        std::cout << "Speaker " << seg.speaker_id
                  << ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
    }
}

For full control over the pipeline:

CTC (English, punctuation & capitalization):

auto cfg = parakeet::make_110m_config();
parakeet::ParakeetTDTCTC model(cfg);
model.load_state_dict(axiom::io::safetensors::load("model.safetensors"));

auto wav = parakeet::read_wav("audio.wav");
auto features = parakeet::preprocess_audio(wav.samples);
auto encoder_out = model.encoder()(features);

auto log_probs = model.ctc_decoder()(encoder_out);
auto tokens = parakeet::ctc_greedy_decode(log_probs);

parakeet::Tokenizer tokenizer;
tokenizer.load("vocab.txt");
std::cout << tokenizer.decode(tokens[0]) << std::endl;

TDT (Token-and-Duration Transducer):

auto encoder_out = model.encoder()(features);
auto tokens = parakeet::tdt_greedy_decode(model, encoder_out, cfg.durations);
std::cout << tokenizer.decode(tokens[0]) << std::endl;

Timestamps (CTC or TDT):

// CTC timestamps
auto ts = parakeet::ctc_greedy_decode_with_timestamps(log_probs);

// TDT timestamps
auto ts = parakeet::tdt_greedy_decode_with_timestamps(model, encoder_out, cfg.durations);

// Group into word-level timestamps
auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());

GPU acceleration (Metal):

model.to(axiom::Device::GPU);
auto features_gpu = features.gpu();
auto encoder_out = model.encoder()(features_gpu);

// Decode on CPU
auto tokens = parakeet::ctc_greedy_decode(
    model.ctc_decoder()(encoder_out).cpu()
);

Usage: parakeet <model.safetensors> <audio.wav> [options]

Model types:
  --model TYPE     Model type (default: tdt-ctc-110m)
                   Types: tdt-ctc-110m, tdt-600m, eou-120m,
                          nemotron-600m, sortformer

Decoder options:
  --ctc            Use CTC decoder (default: TDT)
  --tdt            Use TDT decoder

Other options:
  --vocab PATH     SentencePiece vocab file
  --gpu            Run on Metal GPU
  --timestamps     Show word-level timestamps
  --streaming      Use streaming mode (eou/nemotron models)
  --latency N      Right context frames for nemotron (0/1/6/13)
  --features PATH  Load pre-computed features from .npy file

Examples:

# Basic transcription (TDT decoder, default)
./build/parakeet model.safetensors audio.wav --vocab vocab.txt

# CTC decoder
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc

# GPU acceleration
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu

# Word-level timestamps
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps

# 600M multilingual TDT model
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m

# Streaming with EOU
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m

# Nemotron streaming with configurable latency
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6

# Speaker diarization
./build/parakeet sortformer.safetensors meeting.wav --model sortformer
# Speaker 0: [0.56s - 2.96s]
# Speaker 0: [3.36s - 4.40s]
# Speaker 1: [4.80s - 6.24s]

Requires C++20. Axiom is the only dependency (included as a submodule).

git clone --recursive https://github.com/noahkay13/parakeet.cpp
cd parakeet.cpp
make build

Download a NeMo checkpoint from NVIDIA and convert to safetensors:

# Download from HuggingFace (requires pip install huggingface_hub)
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir .

# Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors

The converter supports all model types via the --model flag:

# 110M TDT-CTC (default)
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc

# 600M multilingual TDT
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt

# 120M EOU streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m

# 600M Nemotron streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m

# 117M Sortformer diarization
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformer

Also supports raw .ckpt files and inspection:

python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors
python scripts/convert_nemo.py --dump model.nemo  # inspect checkpoint keys

Grab the SentencePiece vocab from the same HuggingFace repo. The file is inside the .nemo archive, or download directly:

# Extract from .nemo
tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model
# or use the vocab.txt from the HF files page

Built on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):

Model	Class	Decoder	Use case
CTC	`ParakeetCTC`	Greedy argmax	Fast, English-only
RNNT	`ParakeetRNNT`	Autoregressive LSTM	Streaming capable
TDT	`ParakeetTDT`	LSTM + duration prediction	Better accuracy than RNNT
TDT-CTC	`ParakeetTDTCTC`	Both TDT and CTC heads	Switch decoder at inference

Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:

Model	Class	Decoder	Use case
EOU	`ParakeetEOU`	Streaming RNNT	End-of-utterance detection
Nemotron	`ParakeetNemotron`	Streaming TDT	Configurable latency streaming

Model	Class	Architecture	Use case
Sortformer	`Sortformer`	NEST encoder → Transformer → sigmoid	Speaker diarization (up to 4 speakers)

Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).

Encoder throughput — 10s audio:

Model	Params	CPU (ms)	GPU (ms)	GPU Speedup
110m (TDT-CTC)	110M	2,581	27	96x
tdt-600m	600M	10,779	520	21x
rnnt-600m	600M	10,648	1,468	7x
sortformer	117M	3,195	479	7x

110m GPU scaling across audio lengths:

Audio	CPU (ms)	GPU (ms)	RTF	Throughput
1s	262	24	0.024	41x
5s	1,222	26	0.005	190x
10s	2,581	27	0.003	370x
30s	10,061	32	0.001	935x
60s	26,559	72	0.001	833x

GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.

# Full suite
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"

# Single model
make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m"

# Markdown table output
./build/parakeet_bench --110m=models/model.safetensors --markdown

# Skip GPU benchmarks
./build/parakeet_bench --110m=models/model.safetensors --no-gpu

Available model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.

Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
Offline models have ~4-5 minute audio length limits; split longer files or use streaming models
Blank token ID is 1024 (110M) or 8192 (600M)
GPU acceleration requires Apple Silicon with Metal support
Timestamps use frame-level alignment: frame * 0.08s (8x subsampling × 160 hop / 16kHz)
Sortformer diarization uses unnormalized features (normalize = false) — this differs from ASR models

MIT