展示HN:Andrej Karpathy的microgpt.py到C99 microgpt.c – 速度提升4,600倍
Show HN: Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster

原始链接: https://github.com/enjector/microgpt-c

## MicroGPT-C:一个极简的GPT实现 MicroGPT-C 是一个零依赖、纯C99实现的GPT风格字符级语言模型,模仿了Andrej Karpathy的microgpt.py。它专为教育、实验和资源受限的环境设计。该模型在人名数据集上训练,展示了GPT的核心原理——注意力机制、反向传播和Adam优化器——而无需依赖PyTorch或GPU等框架。 主要特性包括:极小的内存占用(<50KB RAM)、快速训练(1,000步耗时20毫秒)和推理(生成人名耗时微秒)、以及与Python参考实现相比显著的加速(训练速度可提高高达4,600倍)。可选的编译器驱动SIMD自动矢量化进一步提升性能。 该项目提供浮点数和INT8量化构建,后者可将权重存储减少8倍。它非常适合学生、嵌入式系统工程师和寻求可审计模型实验基线的研究人员。已提供Linux、macOS和Windows的源代码和构建说明。

一位开发者 Ajay__soni 通过用纯 C99 重写 Andrej Karpathy 最近发布的 `microgpt.py`(一个极简的 GPT 实现),显著提升了其性能。 得到的 `microgpt-c` 是一个零依赖实现,在 MacBook Pro M2 Max 上训练速度提升了 **4600 倍**(在 Windows 上为 2300 倍)。 这种加速是通过诸如 **SIMD 自动矢量化**(用于更快的矩阵运算)和 **INT8 量化**(将权重存储减少约 8 倍)等技术实现的。 该项目旨在探索 GPT 算法的硬件极限,并展示了为“底层硬件”优化的强大力量。 代码已在 GitHub 上提供,并附有完整注释。 该开发者计划使用这个高效的基础来构建实用的工具,首先是一个 C 代码静态分析器。
相关文章

原文

A zero-dependency, pure C99 implementation of a GPT-style character-level language model.

The algorithm faithfully matches Andrej Karpathy's microgpt.py — same architecture, same training loop, same sampling — but compiles to native code with optional compiler-driven SIMD auto-vectorisation for dramatically faster training and inference.

Train a GPT in 20 ms. Generate names in microseconds. No Python. No PyTorch. No GPU.


MicroGPT-C is a minimal, readable implementation of a GPT (Generative Pre-trained Transformer) — the same family of models behind ChatGPT, but stripped down to its essential algorithm. It trains a tiny character-level language model that learns to generate realistic human names from scratch.

The goal is education and experimentation: understand how attention, backpropagation, and the Adam optimiser actually work at the lowest level, without any framework abstractions.

Audience Value
Students & educators Study attention, softmax, Adam, and backprop in readable C — no framework magic
Embedded / edge engineers Entire model fits in < 50 KB RAM; runs on MCUs with no runtime dependencies
Researchers Auditable baseline for quantisation, custom layers, or optimiser experiments
Rapid prototypers Train → iterate in milliseconds; test tokenisers, vocabularies, data formats

# Linux / macOS
chmod +x build.sh
./build.sh
./build/microgpt
:: Windows
build.bat
build\Release\microgpt.exe

The build automatically copies data/names.txt next to the executable.


Measured on the same workload (1,000 training steps, 20 inference samples) — C vs the reference Python:

Metric Python C (fp64) Speedup
Training time ~93 s 0.02 s ~4,600×
Training throughput ~0.1 k tok/s ~289 k tok/s ~2,800×
Steps/sec ~11 ~40,000 ~3,600×
Inference time ~0.74 s < 1 ms ~700×+
Inference rate ~27 samples/s 20,000 samples/s ~740×
Token throughput 109,000 tok/s

INT8 quantised build: ~25% slower training than fp64 on this tiny model, but ~8× smaller weight storage — ideal for constrained devices.


A single-layer, decoder-only Transformer following the GPT-2 design:

Input → Token Embed + Pos Embed → RMSNorm
  → Self-Attention (4 heads, causal) → Residual
  → RMSNorm → MLP (fc1 → ReLU → fc2, 4× width) → Residual
  → Linear (lm_head) → Softmax → next-token probabilities
Parameter Value
Embedding dim 16
Attention heads 4
Layers 1
Context length 16
Total parameters ~4,600
Weight memory (fp64) ~37 KB
Weight memory (INT8) ~4.6 KB
Training memory ~144 KB
Inference memory < 50 KB

Training uses the Adam optimiser with linear learning-rate decay (configurable in microgpt.h).


Build scripts (recommended)

Platform Standard SIMD (faster)
Linux/macOS ./build.sh ./build.sh --simd
Windows build.bat build.bat simd

The --simd flag enables compiler-driven auto-vectorisation of the core dot products, matrix multiplications, and normalisations. On x86-64 the compiler targets the best available instruction set (SSE4, AVX2, etc.) via -march=native; on MSVC it enables /arch:AVX2. This gives a measurable speed-up on larger models without any hand-written intrinsics — the compiler re-writes the scalar loops into SIMD instructions automatically.

# Linux / macOS — auto-detect best ISA
./build.sh --simd

# CMake directly
cmake -DMICROGPT_SIMD=ON ..
cmake --build . --config Release

Weights are stored as 8-bit integers with per-matrix scales — the forward pass dequantises on the fly; Adam updates an fp64 master copy and requantises each step. This reduces weight storage by ~8× (37 KB → 4.6 KB) at a small accuracy/speed trade-off.

Platform Standard SIMD
Linux/macOS ./build_quantised.sh ./build_quantised.sh --simd
Windows build_quantised.bat build_quantised.bat simd
mkdir build && cd build
cmake ..
cmake --build . --config Release

# With INT8 quantisation
cmake -DQUANTIZATION_INT8=ON ..

# With SIMD auto-vectorisation
cmake -DMICROGPT_SIMD=ON ..

# Both
cmake -DQUANTIZATION_INT8=ON -DMICROGPT_SIMD=ON ..

Path Description
microgpt.h Model config, public API declarations
microgpt.c Core engine: model, forward/backward, Adam, data loading
main.c Entry point: load data → train → generate samples
microgpt_amalgamated.c Single-file build — same algorithm, no header needed
data/names.txt Training data (one name per line, ~32k names)
CMakeLists.txt CMake build (C99, Release, optional SIMD / INT8)

microgpt_amalgamated.c is a self-contained single file containing the full GPT algorithm — data loading, training, and inference. No header file needed:

# Compile directly (no CMake required)
cc -O2 -o microgpt microgpt_amalgamated.c -lm
cp data/names.txt . && ./microgpt

# Or via CMake
cmake --build build --config Release --target microgpt_amalgamated
./build/microgpt_amalgamated

  • C99 compiler (GCC, Clang, MSVC)
  • CMake 3.10+
  • No other dependencies

MIT — see LICENSE and source file headers.

Author: Ajay Soni ([email protected]), Enjector Software Ltd.

联系我们 contact @ memedata.com