Fast speech recognition with NVIDIA's Parakeet models in pure C++.
Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.
~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU.
| Model | Class | Size | Type | Description |
|---|---|---|---|---|
tdt-ctc-110m |
ParakeetTDTCTC |
110M | Offline | English, dual CTC/TDT decoder heads |
tdt-600m |
ParakeetTDT |
600M | Offline | Multilingual, TDT decoder |
eou-120m |
ParakeetEOU |
120M | Streaming | English, RNNT with end-of-utterance detection |
nemotron-600m |
ParakeetNemotron |
600M | Streaming | Multilingual, configurable latency (80ms–1120ms) |
sortformer |
Sortformer |
117M | Streaming | Speaker diarization (up to 4 speakers) |
All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.
#include <parakeet/parakeet.hpp>
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // optional — Metal acceleration
auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;Choose decoder at call site:
auto result = t.transcribe("audio.wav", parakeet::Decoder::CTC); // fast greedy
auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT); // better accuracy (default)Word-level timestamps:
auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT, /*timestamps=*/true);
for (const auto &w : result.word_timestamps) {
std::cout << "[" << w.start << "s - " << w.end << "s] " << w.word << std::endl;
}parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu();
auto result = t.transcribe("audio.wav");parakeet::TDTTranscriber t("model.safetensors", "vocab.txt",
parakeet::make_tdt_600m_config());
auto result = t.transcribe("audio.wav");parakeet::StreamingTranscriber t("model.safetensors", "vocab.txt",
parakeet::make_eou_120m_config());
// Feed audio chunks (e.g., from microphone)
while (auto chunk = get_audio_chunk()) {
auto text = t.transcribe_chunk(chunk);
if (!text.empty()) std::cout << text << std::flush;
}
std::cout << t.get_text() << std::endl;// Latency modes: 0=80ms, 1=160ms, 6=560ms, 13=1120ms
auto cfg = parakeet::make_nemotron_600m_config(/*latency_frames=*/1);
parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);
while (auto chunk = get_audio_chunk()) {
auto text = t.transcribe_chunk(chunk);
if (!text.empty()) std::cout << text << std::flush;
}Identify who spoke when — detects up to 4 speakers with per-frame activity probabilities:
parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));
auto wav = parakeet::read_wav("meeting.wav");
auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});
auto segments = model.diarize(features);
for (const auto &seg : segments) {
std::cout << "Speaker " << seg.speaker_id
<< ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
}
// Speaker 0: [0.56s - 2.96s]
// Speaker 0: [3.36s - 4.40s]
// Speaker 1: [4.80s - 6.24s]Streaming diarization with arrival-order speaker tracking:
parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));
parakeet::EncoderCache enc_cache;
parakeet::AOSCCache aosc_cache(4); // max 4 speakers
while (auto chunk = get_audio_chunk()) {
auto features = parakeet::preprocess_audio(chunk, {.normalize = false});
auto segments = model.diarize_chunk(features, enc_cache, aosc_cache);
for (const auto &seg : segments) {
std::cout << "Speaker " << seg.speaker_id
<< ": [" << seg.start << "s - " << seg.end << "s]" << std::endl;
}
}For full control over the pipeline:
CTC (English, punctuation & capitalization):
auto cfg = parakeet::make_110m_config();
parakeet::ParakeetTDTCTC model(cfg);
model.load_state_dict(axiom::io::safetensors::load("model.safetensors"));
auto wav = parakeet::read_wav("audio.wav");
auto features = parakeet::preprocess_audio(wav.samples);
auto encoder_out = model.encoder()(features);
auto log_probs = model.ctc_decoder()(encoder_out);
auto tokens = parakeet::ctc_greedy_decode(log_probs);
parakeet::Tokenizer tokenizer;
tokenizer.load("vocab.txt");
std::cout << tokenizer.decode(tokens[0]) << std::endl;TDT (Token-and-Duration Transducer):
auto encoder_out = model.encoder()(features);
auto tokens = parakeet::tdt_greedy_decode(model, encoder_out, cfg.durations);
std::cout << tokenizer.decode(tokens[0]) << std::endl;Timestamps (CTC or TDT):
// CTC timestamps
auto ts = parakeet::ctc_greedy_decode_with_timestamps(log_probs);
// TDT timestamps
auto ts = parakeet::tdt_greedy_decode_with_timestamps(model, encoder_out, cfg.durations);
// Group into word-level timestamps
auto words = parakeet::group_timestamps(ts[0], tokenizer.pieces());GPU acceleration (Metal):
model.to(axiom::Device::GPU);
auto features_gpu = features.gpu();
auto encoder_out = model.encoder()(features_gpu);
// Decode on CPU
auto tokens = parakeet::ctc_greedy_decode(
model.ctc_decoder()(encoder_out).cpu()
);Usage: parakeet <model.safetensors> <audio.wav> [options]
Model types:
--model TYPE Model type (default: tdt-ctc-110m)
Types: tdt-ctc-110m, tdt-600m, eou-120m,
nemotron-600m, sortformer
Decoder options:
--ctc Use CTC decoder (default: TDT)
--tdt Use TDT decoder
Other options:
--vocab PATH SentencePiece vocab file
--gpu Run on Metal GPU
--timestamps Show word-level timestamps
--streaming Use streaming mode (eou/nemotron models)
--latency N Right context frames for nemotron (0/1/6/13)
--features PATH Load pre-computed features from .npy file
Examples:
# Basic transcription (TDT decoder, default)
./build/parakeet model.safetensors audio.wav --vocab vocab.txt
# CTC decoder
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --ctc
# GPU acceleration
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --gpu
# Word-level timestamps
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --timestamps
# 600M multilingual TDT model
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model tdt-600m
# Streaming with EOU
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model eou-120m
# Nemotron streaming with configurable latency
./build/parakeet model.safetensors audio.wav --vocab vocab.txt --model nemotron-600m --latency 6
# Speaker diarization
./build/parakeet sortformer.safetensors meeting.wav --model sortformer
# Speaker 0: [0.56s - 2.96s]
# Speaker 0: [3.36s - 4.40s]
# Speaker 1: [4.80s - 6.24s]Requires C++20. Axiom is the only dependency (included as a submodule).
git clone --recursive https://github.com/noahkay13/parakeet.cpp
cd parakeet.cpp
make buildDownload a NeMo checkpoint from NVIDIA and convert to safetensors:
# Download from HuggingFace (requires pip install huggingface_hub)
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir .
# Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensorsThe converter supports all model types via the --model flag:
# 110M TDT-CTC (default)
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 110m-tdt-ctc
# 600M multilingual TDT
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt
# 120M EOU streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model eou-120m
# 600M Nemotron streaming
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model nemotron-600m
# 117M Sortformer diarization
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model sortformerAlso supports raw .ckpt files and inspection:
python scripts/convert_nemo.py model_weights.ckpt -o model.safetensors
python scripts/convert_nemo.py --dump model.nemo # inspect checkpoint keysGrab the SentencePiece vocab from the same HuggingFace repo. The file is inside the .nemo archive, or download directly:
# Extract from .nemo
tar xf parakeet-tdt_ctc-110m.nemo ./tokenizer.model
# or use the vocab.txt from the HF files pageBuilt on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):
| Model | Class | Decoder | Use case |
|---|---|---|---|
| CTC | ParakeetCTC |
Greedy argmax | Fast, English-only |
| RNNT | ParakeetRNNT |
Autoregressive LSTM | Streaming capable |
| TDT | ParakeetTDT |
LSTM + duration prediction | Better accuracy than RNNT |
| TDT-CTC | ParakeetTDTCTC |
Both TDT and CTC heads | Switch decoder at inference |
Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:
| Model | Class | Decoder | Use case |
|---|---|---|---|
| EOU | ParakeetEOU |
Streaming RNNT | End-of-utterance detection |
| Nemotron | ParakeetNemotron |
Streaming TDT | Configurable latency streaming |
| Model | Class | Architecture | Use case |
|---|---|---|---|
| Sortformer | Sortformer |
NEST encoder → Transformer → sigmoid | Speaker diarization (up to 4 speakers) |
Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).
Encoder throughput — 10s audio:
| Model | Params | CPU (ms) | GPU (ms) | GPU Speedup |
|---|---|---|---|---|
| 110m (TDT-CTC) | 110M | 2,581 | 27 | 96x |
| tdt-600m | 600M | 10,779 | 520 | 21x |
| rnnt-600m | 600M | 10,648 | 1,468 | 7x |
| sortformer | 117M | 3,195 | 479 | 7x |
110m GPU scaling across audio lengths:
| Audio | CPU (ms) | GPU (ms) | RTF | Throughput |
|---|---|---|---|---|
| 1s | 262 | 24 | 0.024 | 41x |
| 5s | 1,222 | 26 | 0.005 | 190x |
| 10s | 2,581 | 27 | 0.003 | 370x |
| 30s | 10,061 | 32 | 0.001 | 935x |
| 60s | 26,559 | 72 | 0.001 | 833x |
GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.
# Full suite
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"
# Single model
make bench-single ARGS="--110m=models/model.safetensors --benchmark_filter=110m"
# Markdown table output
./build/parakeet_bench --110m=models/model.safetensors --markdown
# Skip GPU benchmarks
./build/parakeet_bench --110m=models/model.safetensors --no-gpuAvailable model flags: --110m, --tdt-600m, --rnnt-600m, --sortformer. All Google Benchmark flags (--benchmark_filter, --benchmark_format=json, --benchmark_repetitions=N) are passed through.
- Audio: 16kHz mono WAV (16-bit PCM or 32-bit float)
- Offline models have ~4-5 minute audio length limits; split longer files or use streaming models
- Blank token ID is 1024 (110M) or 8192 (600M)
- GPU acceleration requires Apple Silicon with Metal support
- Timestamps use frame-level alignment:
frame * 0.08s(8x subsampling × 160 hop / 16kHz) - Sortformer diarization uses unnormalized features (
normalize = false) — this differs from ASR models
MIT