我用纯C编写了gemma3推理。

我用纯C编写了gemma3推理。
I have written gemma3 inference in pure C

原始链接: https://github.com/robitec97/gemma3.c

## Gemma3：用于 Gemma 3 4B IT 的纯 C 推理引擎 Gemma3 是一个从头开始构建的 CPU 推理引擎，展示了像 Gemma 3 4B IT 这样的大型语言模型可以在*没有* Python、PyTorch 或 GPU 的情况下高效运行。它完全用 100% C11 编写，没有任何外部依赖，完全实现了 Gemma 3 架构，包括 GQA、混合注意力机制和 SwiGLU。主要特性包括内存映射权重（BF16 SafeTensors）、原生 SentencePiece 分词器（262K 词汇量）和流式输出。它提供 CLI 和库 API，可在 Linux/macOS 上原生运行，或通过 WSL 或 MinGW 在 Windows 上运行。该项目提供一个 Python 脚本，方便模型下载和验证。性能大约为预填充阶段 2-5 个 token/秒，生成阶段 1-3 个 token/秒，需要大约 3GB 的 RAM。虽然目前仅支持文本且缺乏量化，但 Gemma3 仅使用 CPU 就能展现出令人印象深刻的性能。它采用 MIT 许可证发布，模型权重受 Google 的 Gemma 许可证管辖。

一位开发者分享了其纯C语言实现的Gemma 3推理代码（github.com/robitec97），在Hacker News上引发了讨论。该项目以简洁的代码量（600行）实现了LLM的核心功能，从GELU到RoPE。评论者强调了与基本C循环相比，使用SIMD（如highway库）可以实现的性能提升，并承认了GPU依赖的复杂性。虽然Python/PyTorch依赖于底层的C库，但该项目证明了LLM *可以* 在没有它们的情况下运行。讨论涉及了该项目在llama.cpp存在情况下的相关性，以及Gemma 3 *确实* 被用于生产环境的令人惊讶的事实，特别是其强大的多语言支持和“安全”的输出——使其适用于欧洲等地区的聊天机器人。用户也注意到其不错的跨模态能力。尽管未来会有Gemma的新版本，但一些人认为它是一个有价值且可微调的模型。该项目还引发了对能源效率的思考，尽管GPU的功耗较高，但通常比CPU更适合LLM任务。

原文

gemma3.c is a from‑scratch CPU inference engine for the Gemma 3 4B IT model. It proves that modern LLMs can run without Python, PyTorch, or GPUs.

⚙️ 100% Pure C (C11) – zero external dependencies
🧠 Full Gemma 3 architecture – GQA, hybrid attention, SwiGLU
🗺️ Memory‑mapped weights – BF16 SafeTensors via mmap
🔤 Native SentencePiece tokenizer – 262K vocab
🌊 Streaming output – token‑by‑token callbacks
💬 Interactive chat mode
📦 CLI + Library API
🐧 Linux/macOS native, 🪟 Windows via WSL (recommended) or MinGW

⚠️ POSIX‑first: native on Linux/macOS. On Windows use WSL or MinGW (no mmap).

1️⃣ Download model (recommended)

export HF_TOKEN=your_token_here
python download_model.py

# Single prompt
./gemma3 -m ./gemma-3-4b-it -p "Explain quantum computing simply."

# Interactive chat
./gemma3 -m ./gemma-3-4b-it -i

The included Python script:

Handles HuggingFace auth
Downloads all shards
Resumes broken downloads
Verifies integrity

python download_model.py --token YOUR_HF_TOKEN

Manual alternatives: huggingface-cli or git lfs.

make        # Optimized
make debug  # Debug symbols
make fast   # -march=native -ffast-math
make clean

-m <path>    Model directory
-p <text>    Prompt
-i           Interactive mode
-s <text>    System prompt
-n <n>       Max tokens
-t <f>       Temperature
-k <n>       Top‑k
--top-p <f>  Top‑p
-c <n>       Context size
--seed <n>   RNG seed
-v           Verbose

gemma3_ctx *ctx = gemma3_load_dir("./gemma-3-4b-it");

gemma3_gen_params params = gemma3_default_params();
char *out = gemma3_generate(ctx, "Hello!", &params, NULL, NULL);
printf("%s\n", out);
free(out);

gemma3_free(ctx);

Param	Value
Vocab	262,208
Layers	34
Hidden	2,560
Heads	8 (4 KV, GQA)
Context	128K
Pattern	5 local : 1 global

Weights: ~8 GB on disk (BF16)
Runtime RAM: ~3 GB total

Reduce usage:

./gemma3 -m ./gemma-3-4b-it -c 512 -p "Hello"

Prefill: ~2–5 tok/s
Generation: ~1–3 tok/s

Use:

CPU only
Text only
No quantization (yet)

MIT License. Model weights under Google’s Gemma license.

If you ever wanted to see Gemma 3 breathe in pure C, this is it.

我用纯C编写了gemma3推理。 I have written gemma3 inference in pure C

1️⃣ Download model (recommended)

我用纯C编写了gemma3推理。
I have written gemma3 inference in pure C