展示HN：Andrej Karpathy的microgpt.py到C99 microgpt.c

展示HN：Andrej Karpathy的microgpt.py到C99 microgpt.c – 速度提升4,600倍
Show HN: Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster

原始链接: https://github.com/enjector/microgpt-c

## MicroGPT-C：一个极简的GPT实现 MicroGPT-C 是一个零依赖、纯C99实现的GPT风格字符级语言模型，模仿了Andrej Karpathy的microgpt.py。它专为教育、实验和资源受限的环境设计。该模型在人名数据集上训练，展示了GPT的核心原理——注意力机制、反向传播和Adam优化器——而无需依赖PyTorch或GPU等框架。主要特性包括：极小的内存占用（<50KB RAM）、快速训练（1,000步耗时20毫秒）和推理（生成人名耗时微秒）、以及与Python参考实现相比显著的加速（训练速度可提高高达4,600倍）。可选的编译器驱动SIMD自动矢量化进一步提升性能。该项目提供浮点数和INT8量化构建，后者可将权重存储减少8倍。它非常适合学生、嵌入式系统工程师和寻求可审计模型实验基线的研究人员。已提供Linux、macOS和Windows的源代码和构建说明。

一位开发者 Ajay__soni 通过用纯 C99 重写 Andrej Karpathy 最近发布的 `microgpt.py`（一个极简的 GPT 实现），显著提升了其性能。得到的 `microgpt-c` 是一个零依赖实现，在 MacBook Pro M2 Max 上训练速度提升了 **4600 倍**（在 Windows 上为 2300 倍）。这种加速是通过诸如 **SIMD 自动矢量化**（用于更快的矩阵运算）和 **INT8 量化**（将权重存储减少约 8 倍）等技术实现的。该项目旨在探索 GPT 算法的硬件极限，并展示了为“底层硬件”优化的强大力量。代码已在 GitHub 上提供，并附有完整注释。该开发者计划使用这个高效的基础来构建实用的工具，首先是一个 C 代码静态分析器。

原文

A zero-dependency, pure C99 implementation of a GPT-style character-level language model.

The algorithm faithfully matches Andrej Karpathy's microgpt.py — same architecture, same training loop, same sampling — but compiles to native code with optional compiler-driven SIMD auto-vectorisation for dramatically faster training and inference.

Train a GPT in 20 ms. Generate names in microseconds. No Python. No PyTorch. No GPU.

MicroGPT-C is a minimal, readable implementation of a GPT (Generative Pre-trained Transformer) — the same family of models behind ChatGPT, but stripped down to its essential algorithm. It trains a tiny character-level language model that learns to generate realistic human names from scratch.

The goal is education and experimentation: understand how attention, backpropagation, and the Adam optimiser actually work at the lowest level, without any framework abstractions.

Audience	Value
Students & educators	Study attention, softmax, Adam, and backprop in readable C — no framework magic
Embedded / edge engineers	Entire model fits in < 50 KB RAM; runs on MCUs with no runtime dependencies
Researchers	Auditable baseline for quantisation, custom layers, or optimiser experiments
Rapid prototypers	Train → iterate in milliseconds; test tokenisers, vocabularies, data formats

# Linux / macOS
chmod +x build.sh
./build.sh
./build/microgpt

:: Windows
build.bat
build\Release\microgpt.exe

The build automatically copies data/names.txt next to the executable.

Measured on the same workload (1,000 training steps, 20 inference samples) — C vs the reference Python:

Metric	Python	C (fp64)	Speedup
Training time	~93 s	0.02 s	~4,600×
Training throughput	~0.1 k tok/s	~289 k tok/s	~2,800×
Steps/sec	~11	~40,000	~3,600×
Inference time	~0.74 s	< 1 ms	~700×+
Inference rate	~27 samples/s	20,000 samples/s	~740×
Token throughput	—	109,000 tok/s	—

INT8 quantised build: ~25% slower training than fp64 on this tiny model, but ~8× smaller weight storage — ideal for constrained devices.

A single-layer, decoder-only Transformer following the GPT-2 design:

Input → Token Embed + Pos Embed → RMSNorm
  → Self-Attention (4 heads, causal) → Residual
  → RMSNorm → MLP (fc1 → ReLU → fc2, 4× width) → Residual
  → Linear (lm_head) → Softmax → next-token probabilities

Parameter	Value
Embedding dim	16
Attention heads	4
Layers	1
Context length	16
Total parameters	~4,600
Weight memory (fp64)	~37 KB
Weight memory (INT8)	~4.6 KB
Training memory	~144 KB
Inference memory	< 50 KB

Training uses the Adam optimiser with linear learning-rate decay (configurable in microgpt.h).

Build scripts (recommended)

Platform	Standard	SIMD (faster)
Linux/macOS	`./build.sh`	`./build.sh --simd`
Windows	`build.bat`	`build.bat simd`

The --simd flag enables compiler-driven auto-vectorisation of the core dot products, matrix multiplications, and normalisations. On x86-64 the compiler targets the best available instruction set (SSE4, AVX2, etc.) via -march=native; on MSVC it enables /arch:AVX2. This gives a measurable speed-up on larger models without any hand-written intrinsics — the compiler re-writes the scalar loops into SIMD instructions automatically.

# Linux / macOS — auto-detect best ISA
./build.sh --simd

# CMake directly
cmake -DMICROGPT_SIMD=ON ..
cmake --build . --config Release

Weights are stored as 8-bit integers with per-matrix scales — the forward pass dequantises on the fly; Adam updates an fp64 master copy and requantises each step. This reduces weight storage by ~8× (37 KB → 4.6 KB) at a small accuracy/speed trade-off.

Platform	Standard	SIMD
Linux/macOS	`./build_quantised.sh`	`./build_quantised.sh --simd`
Windows	`build_quantised.bat`	`build_quantised.bat simd`

mkdir build && cd build
cmake ..
cmake --build . --config Release

# With INT8 quantisation
cmake -DQUANTIZATION_INT8=ON ..

# With SIMD auto-vectorisation
cmake -DMICROGPT_SIMD=ON ..

# Both
cmake -DQUANTIZATION_INT8=ON -DMICROGPT_SIMD=ON ..

Path	Description
`microgpt.h`	Model config, public API declarations
`microgpt.c`	Core engine: model, forward/backward, Adam, data loading
`main.c`	Entry point: load data → train → generate samples
`microgpt_amalgamated.c`	Single-file build — same algorithm, no header needed
`data/names.txt`	Training data (one name per line, ~32k names)
`CMakeLists.txt`	CMake build (C99, Release, optional SIMD / INT8)

microgpt_amalgamated.c is a self-contained single file containing the full GPT algorithm — data loading, training, and inference. No header file needed:

# Compile directly (no CMake required)
cc -O2 -o microgpt microgpt_amalgamated.c -lm
cp data/names.txt . && ./microgpt

# Or via CMake
cmake --build build --config Release --target microgpt_amalgamated
./build/microgpt_amalgamated

C99 compiler (GCC, Clang, MSVC)
CMake 3.10+
No other dependencies

MIT — see LICENSE and source file headers.

Author: Ajay Soni ([email protected]), Enjector Software Ltd.

展示HN：Andrej Karpathy的microgpt.py到C99 microgpt.c – 速度提升4,600倍 Show HN: Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster

Build scripts (recommended)

展示HN：Andrej Karpathy的microgpt.py到C99 microgpt.c – 速度提升4,600倍
Show HN: Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster