自动内核:GPU内核的自动研究
AutoKernel: Autoresearch for GPU Kernels

原始链接: https://github.com/RightNow-AI/autokernel

## AutoKernel:自主GPU内核优化 AutoKernel是一个自动研究系统,灵感来自Karpathy在LLM训练方面的工作,旨在自动优化PyTorch模型的GPU内核。只需提供一个PyTorch模型并让系统运行——它将自主识别性能瓶颈,将其提取为Triton内核,并迭代优化每一个。 该系统通过一个代理修改单个`kernel.py`文件,并根据固定的、严格的5阶段正确性检查对其进行基准测试,并保留或撤销更改。这个过程会持续重复,并以Amdahl定律为指导,优先考虑有影响的优化。它支持9种核心深度学习内核类型(matmul、softmax、layernorm等)。 AutoKernel需要NVIDIA GPU和Python 3.10+。它提供用于性能分析、内核提取、基准测试和验证的工具,并将结果记录在人类可读的TSV文件中。该代理遵循`program.md`中的详细指令,从而实现长期、自主的实验。目标是利用Triton可读的语法和快速的编译时间来实现显著的加速,同时在每一步确保正确性。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 AutoKernel:GPU内核的自动研究 (github.com/rightnow-ai) 7点 由 frozenseven 20分钟前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.

AutoKernel Progress

Inspired by @karpathy/autoresearch -- which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.

Give AutoKernel any PyTorch model. It will:

  1. Profile the model to find which GPU kernels are bottlenecks
  2. Extract each bottleneck as a standalone Triton kernel
  3. Optimize each kernel autonomously (edit, benchmark, keep/revert -- forever)
  4. Verify end-to-end correctness and report the total speedup

The agent reads program.md -- the "research org code" -- which contains comprehensive instructions for autonomous operation. It edits kernel.py one kernel at a time, runs bench.py (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl's law.

Each experiment takes ~90 seconds. That's ~40 experiments/hour, ~320 overnight, across all kernels.

Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync

# One-time setup: test data + baselines
uv run prepare.py

# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
 --input-shape 1,512 --dtype float16

# Extract top bottleneck kernels
uv run extract.py --top 5

# Verify benchmark works
uv run bench.py

Spin up Claude, Codex, or any coding agent in this directory:

Read program.md and let's kick off a new experiment. Start with setup.

The agent will:

  1. Profile your model and present the optimization plan
  2. Create a branch (e.g., autokernel/mar10-llama7b)
  3. Optimize each bottleneck kernel in priority order
  4. Verify end-to-end correctness and report total speedup

program.md is intentionally comprehensive so the agent can run 10+ hours without getting stuck. It includes a 6-tier optimization playbook, decision framework, crash handling, and Amdahl's law reasoning.

                 profile.py              extract.py           bench.py (loop)         verify.py
Any PyTorch  ──>  Rank kernels  ──>  Generate baseline  ──>  Optimize each  ──>  End-to-end
   model          by GPU time       Triton kernels          kernel (agent)       verification
Tool What it does
profile.py Profiles any PyTorch model with torch.profiler, ranks kernels by GPU time, classifies as compute/memory-bound
extract.py Extracts top-N bottleneck kernels from profiling results into standalone Triton kernel files
orchestrate.py Multi-kernel scheduler: decides which kernel to optimize next using Amdahl's law, tracks aggregate progress
bench.py Fixed benchmark: 5-stage correctness (smoke, shape sweep, numerical stability, determinism, edge cases) + performance + roofline
verify.py Plugs optimized kernels back into the model, checks end-to-end correctness, reports total speedup

9 kernel types covering the core operations of modern deep learning:

Kernel Description Key Metric
matmul Dense matrix multiplication (M x K) @ (K x N) TFLOPS
softmax Row-parallel numerically stable softmax GB/s
layernorm Layer normalization with affine transform GB/s
rmsnorm RMS normalization (LLaMA-style) GB/s
flash_attention Scaled dot-product attention with causal masking TFLOPS
fused_mlp SwiGLU-style fused MLP (gate + up + down) TFLOPS
cross_entropy Fused cross entropy loss GB/s
rotary_embedding Rotary position embeddings (RoPE) GB/s
reduce Parallel reduction (sum) GB/s

Each has a PyTorch reference in reference.py and a starter Triton kernel in kernels/.

Self-contained model definitions ship with AutoKernel (no transformers library needed):

Model File Params Usage
GPT-2 Small models/gpt2.py 124M --class-name GPT2 --input-shape 1,1024
LLaMA (compact) models/llama_7b.py 160M --class-name LlamaModel --input-shape 1,512
LLaMA 7B models/llama_7b.py 7B --class-name LlamaModel7B --input-shape 1,2048
BERT-base models/bert_base.py 110M --class-name BertModel --input-shape 8,512
Custom models/custom.py -- Template for your own model

For HuggingFace models (uv sync --extra models):

uv run profile.py --module transformers --class-name AutoModelForCausalLM \
 --pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16
autokernel/
  kernel.py             the file the agent modifies (one kernel at a time)
  program.md            agent instructions -- the "research org code"

  bench.py              fixed benchmark + 5-stage correctness harness
  reference.py          PyTorch reference implementations (ground truth)
  prepare.py            one-time setup: test data, baselines

  profile.py            profile any PyTorch model, rank kernels by GPU time
  extract.py            extract bottleneck kernels into workspace/
  orchestrate.py        multi-kernel scheduler (Amdahl's law)
  verify.py             end-to-end model verification + speedup report
  analysis.py           experiment visualization (generates progress.png)

  kernels/              starter Triton kernels (9 types)
  models/               self-contained model definitions (GPT-2, LLaMA, BERT)
  workspace/            runtime artifacts (gitignored)

Why Triton. Readable Python-like syntax the agent can understand and modify without mastering inline PTX or SASS. Well-tuned Triton regularly reaches 80-95% of cuBLAS. The agent needs to iterate fast -- Triton compiles in seconds, not minutes.

Correctness first. The benchmark checks kernel output against PyTorch before measuring performance. A fast but wrong kernel is immediately reverted. This prevents the agent from "optimizing" by producing garbage.

Amdahl's law orchestration. The orchestrator prioritizes by impact. A 1.5x speedup on a 60% kernel (1.25x end-to-end) beats a 3x speedup on a 5% kernel (1.03x end-to-end). It moves on when diminishing returns set in.

Single file to modify. The agent only touches kernel.py. Scope stays manageable, diffs reviewable, reverts clean.

TSV logging. Results go to a plain results.tsv file. Human-readable, git-friendly, trivially parseable, no infrastructure.

Every experiment is logged to results.tsv (tab-separated):

Column Description
experiment Sequential experiment number (0 = baseline)
tag Short identifier
kernel_type Which kernel (e.g., matmul)
throughput_tflops Measured throughput (higher is better)
latency_us Execution time in microseconds
pct_peak Percentage of GPU theoretical peak
speedup_vs_pytorch Speedup vs PyTorch/cuBLAS
correctness PASS, FAIL, TIMEOUT, or CRASH
peak_vram_mb Peak GPU memory usage
description What was tried

This project is autoresearch for GPU kernels -- directly inspired by Andrej Karpathy's autoresearch, the original experiment in autonomous AI research agents for LLM training. Karpathy showed that an AI agent can run hundreds of experiments overnight, methodically exploring a search space and logging every result. AutoKernel applies that same loop -- agent edits one file, runs a fixed evaluation, keeps or reverts -- to the domain of GPU kernel optimization with Triton.

Built by the team behind Forge.

MIT

联系我们 contact @ memedata.com