Show HN: NanoEuler – 从零开始用纯 C/CUDA 编写的 GPT-2 规模模型

Show HN: NanoEuler – 从零开始用纯 C/CUDA 编写的 GPT-2 规模模型
Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

原始链接: https://github.com/JustVugg/nanoeuler

该项目是一个从零开始的 GPT-2 类语言模型教学实现，完全使用 C 和 CUDA 编写。它摒弃了所有机器学习框架、库和自动求导系统，转而提供手写的正向和反向传播、自定义字节级 BPE 分词器以及完整的训练流水线。该存储库包含两个主要实现：用于小规模演示的基于 CPU 的版本和高性能 CUDA 引擎。GPU 实现包括针对 FlashAttention、RoPE 和 SwiGLU 等操作的手写内核，所有内核均通过严格的梯度检查与 CPU 参考实现进行了验证。该模型遵循标准的仅解码器（decoder-only）Transformer 架构（RMSNorm、GQA、MTP），并将残差块概念化为连续流的离散欧拉步。该流水线支持在书籍和高质量网页文本语料库上进行预训练，以及通过监督微调（SFT）来创建聊天助手。尽管由此产生的约 1.16 亿参数模型能够生成流畅的英语并遵循指令格式，但它本质上是一个研究产物而非功能性聊天机器人；它在没有外部依赖“黑盒”的情况下，展示了现代大语言模型工程的端到端机制。

一位开发者发布了名为“NanoEuler”的自定义大语言模型，该模型完全使用纯 C 语言和 CUDA 从零构建。出于对深入了解人工智能架构和 GPU 优化的渴望，开发者发起了这一项目，旨在跳出简单的大模型接口调用，探索参数、数据与模型增长之间的关联。该项目从一个基于莎士比亚文集训练的 2300 万参数模型开始，开发者在不使用高级框架的情况下实现了训练和推理流程，以更透彻地掌握模型开发和聊天机器人监督微调（SFT）的复杂细节。目前，该项目已在 GitHub 上开源，开发者正在寻求社区反馈与技术建议。其在 Hacker News 上的初步反响褒贬不一；一位用户质疑了其独特的编码风格以及 CUDA 实现的可靠性，并指出源代码中的一条注释显示该代码尚未经过测试。

原文

A GPT-2-class language model built entirely from scratch in C/CUDA — no PyTorch, no autograd, no ML libraries. The forward and backward passes are written and verified by hand, and the whole training pipeline lives in this repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model (RLHF/DPO planned). It runs on CPU (libm + OpenMP) for a small showcase model, and a full from-scratch CUDA engine — cuBLAS matmuls, a hand-written FlashAttention, validated against a CPU reference by a full-model gradient check — trains a ~116M-parameter model on a single RTX 4070.

Status & honesty. This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is a text generator in the spirit of GPT-2-small: fluent-ish English, no real world knowledge. It is not a capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline.

make check              # verify the backward pass (gradient check, double precision)
make                    # build the training binary
./nanoeuler train       # train the small showcase model (~0.76M params)
./nanoeuler train big   # train the larger model (~10M params; meant for a GPU)
./nanoeuler chat        # REPL: type a prompt, the model continues it

A residual block computes

Read it as a step of numerical integration. The forward-Euler method advances an ordinary differential equation dx/dt = f(x) by

x(t+Δt) = x(t) + Δt · f(x(t))

With step size Δt = 1 this is exactly the residual update. So a deep residual network is a discretized ODE: depth is integration time, and each layer integrates the hidden state forward by one Euler step. This is the view behind work like Neural ODEs (a ResNet is the Euler discretization of a continuous flow). The project is named after Leonhard Euler, who gave us that integration method.

A sample from the ~116M model after a partial pretraining run on the books + web corpus (prompt Alessandro eat a):

Alessandro eat a icing textile: the satisfied by the servants in order to keep your weight
[Using to a heated, collaborated young people that attend the metric process where the rank
is authorized and to contain the sedentary. Some state lawyers were able to insert ...

The content is not meaningful, but notice what it learned on its own: real grammar, long clauses, and an encyclopedic register picked up from the web data. This is the expected behaviour of a small model trained on a single GPU — fluent shape, shallow substance. More training and (far) more data improve fluency; world knowledge needs scale this project does not pretend to have.

Decoder-only transformer with the building blocks common to current models:

RMSNorm (pre-norm, no bias)
Rotary position embeddings (RoPE) applied to queries and keys
SwiGLU feed-forward: down(silu(gate(x)) * up(x))
Grouped-query attention (GQA): query heads share a smaller set of key/value heads
Multi-token prediction (MTP): K output heads predict the next K tokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0.
No biases anywhere.
Byte-level BPE tokenizer, hand-written, with GPT-2-style pretokenization (a single leading space attaches to the following word, so spaces are not wasted as standalone tokens). Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary (~3.4 bytes/token on English).

Each block is x = x + attn(rmsnorm(x)) followed by x = x + swiglu(rmsnorm(x)). A residual connection x = x + f(x) is one step of the forward-Euler method for the ODE dx/dt = f(x) — hence the name, and a nod to Leonhard Euler.

Configurations:

where	dim	q/kv heads	layers	context	vocab	params
`small` (CPU, `nanoeuler.c`)	128	4 / 2	4	128	512	~1.05M
GPU pipeline (`cuda/`, `run_train`)	768	12 / 4	16	512	4096	~116M

The CPU small model trains in a few hours on 12 cores and is a self-contained showcase. The ~116M GPU model is the real pipeline: it pretrains on the books + web mix and is then fine-tuned into a chat model (see below). The head size is 64 (768/12), which fits the FlashAttention kernel.

Hand-written back-propagation is easy to get subtly wrong, so every analytic gradient is compared against a central finite difference. The check runs in double precision so floating-point cancellation does not hide correct gradients:

$ make check
  tok      : max rel err 1.02e-04
  qkvw     : max rel err 7.20e-07
  gatew    : max rel err 6.86e-08
  ...
max relative error: 1.02e-04
>>> backward OK (error < 1e-2)

Every parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP.

make builds with -O3 -march=native -ffast-math -fopenmp. Matrix multiplies and attention are parallelized with OpenMP and vectorized; on a 12-core machine the training loop uses all cores. make check builds a separate double-precision binary used only for the gradient check.

No external dependencies. Tested with gcc 13 on Linux.

This is a from-scratch text generator and a complete, understandable training pipeline — not a product. A model of this size trained on one GPU produces fluent-looking English with little real knowledge; the fine-tuned chat model answers in assistant form but its content is shallow. A usable conversational model needs orders of magnitude more parameters, data and compute (a ~135M model only becomes a basic assistant after ~600B training tokens; this repo trains on a far smaller corpus on a single GPU). The goal is to own every piece — every parameter, every gradient, the tokenizer, the kernels, the pretraining and the fine-tuning.

cuda/nanoeuler_cuda.cu is a full from-scratch CUDA port — forward, backward, training and inference on the GPU. Every kernel is validated on the device against a CPU reference, and the whole model has a GPU gradient check (GPU grads vs CPU grads to ~1e-6).

Kernels: matmul (delegated to cuBLAS with TF32 tensor cores), RMSNorm, RoPE, grouped-query attention with a hand-written FlashAttention (tiled, online softmax, no T×T matrix in memory), SwiGLU, softmax/cross-entropy and AdamW. FlashAttention made the training step about 3× faster.

Build (RTX 40-series = Ada = sm_89; the host-compiler flag avoids a gcc ICE on the large file):

cd cuda
nvcc -O3 -arch=sm_89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler_cuda.cu -o nanoeuler_cuda -lcublas

Modes:

./nanoeuler_cuda              # run all kernel self-tests (GPU vs CPU)
./nanoeuler_cuda g            # full-model gradient check (GPU grads vs CPU)
./nanoeuler_cuda t            # pretrain from scratch, checkpoint to ../nanoeuler.bin every 5k steps
./nanoeuler_cuda tr           # resume pretraining from the latest ../nanoeuler.bin checkpoint
./nanoeuler_cuda i "It was"   # autoregressive generation on GPU
./nanoeuler_cuda s            # supervised fine-tune on Alpaca, save ../nanoeuler_chat.bin
./nanoeuler_cuda c            # interactive chat with the fine-tuned model

Training checkpoints every 5000 steps, so a long run can be stopped (Ctrl-C) and resumed with tr. A model trained on the GPU is saved in the CPU program's format, so ./nanoeuler chat can also load and run it.

Chatbot: pretrain then fine-tune (SFT)

The chat pipeline is two stages. First pretrain the ~116M base on the books + web mix (./nanoeuler_cuda t, resumable with tr). Then supervised fine-tuning turns it into an assistant: ./nanoeuler_cuda s loads the pretrained base, renders each Alpaca example with the standard instruction template, and trains with the loss masked to the response tokens only (prompt and padding positions get a target of -1, which the cross-entropy kernel turns into zero gradient). The result is saved to nanoeuler_chat.bin; ./nanoeuler_cuda c then wraps each line you type in the same template and samples a reply, stopping at the </s> end marker.

After fine-tuning the model answers in the right shape — it follows the instruction→response format, writes complete sentences and stops on its own. The content, though, is shallow and often wrong: this is a small model trained on a single GPU, so it has little world knowledge to express. SFT teaches the model how to respond, not what it knows — that comes from pretraining and scale. This is a faithful, fully-from-scratch demonstration that the pretrain→SFT pipeline works end to end, not a capable assistant.

Pretraining uses a real books + web mix:

Books — data/get_gutenberg.sh downloads ~95 public-domain Project Gutenberg classics (Austen, Dickens, Dostoevsky, Tolstoy, Melville, the complete Shakespeare, ...). Each book's Project Gutenberg license header/footer is stripped (only the text between the *** START ... *** / *** END ... *** markers is kept) so the model trains on prose.
Web — data/get_web.sh pulls a slice of FineWeb-Edu (high-quality educational web text) straight from the Hugging Face parquet files using the DuckDB CLI (a single static binary — no Python, no libraries).

Then concatenate them into the pretraining corpus the trainer reads:

sh data/get_gutenberg.sh                       # books  -> data/gutenberg.txt
sh data/get_web.sh                             # web    -> data/web.txt (~1 GB by default)
cat data/gutenberg.txt data/web.txt > data/pretrain.txt
sh data/get_alpaca.sh                          # instruction data for SFT -> data/alpaca.json

Corpora and model checkpoints are git-ignored (regenerable).

✅ Hand-written byte-level BPE with GPT-2-style pretokenization.
✅ From-scratch CUDA engine (cuBLAS + FlashAttention), validated by a full-model gradient check.
✅ Pretraining on a books + web mix, with checkpoint/resume.
✅ Supervised fine-tuning (Alpaca) with response-masked loss → a chat model.
⏳ DPO (preference optimization) — the alignment stage, next to build.
⏳ Scale the model and data (toward ~270M) and publish a trained checkpoint people can try.

nanoeuler.c             CPU model: forward, backward, training, sampling, chat REPL
cuda/nanoeuler_cuda.cu  GPU engine: BPE, kernels, FlashAttention, pretrain/SFT/infer/chat, gradient check
data/get_gutenberg.sh   downloads + cleans the Gutenberg books corpus
data/get_web.sh         downloads a FineWeb-Edu web slice via the DuckDB CLI (no Python)
data/get_alpaca.sh      downloads the Alpaca instruction data for fine-tuning
Makefile  LICENSE  shakespeare.txt  .gitignore

MIT. See LICENSE.

Show HN: NanoEuler – 从零开始用纯 C/CUDA 编写的 GPT-2 规模模型 Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

Chatbot: pretrain then fine-tune (SFT)

Show HN: NanoEuler – 从零开始用纯 C/CUDA 编写的 GPT-2 规模模型
Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch