自动研究:智能体自动研究单GPU纳米聊天训练。
Autoresearch: Agents researching on single-GPU nanochat training automatically

原始链接: https://github.com/karpathy/autoresearch

## 自主AI研究:一个新时代 前沿AI研究已从人类主导的实验转向在庞大计算基础设施上运行的完全自主AI代理。该项目于2026年3月启动,展示了核心原则:使AI能够通过连续、自动化的实验来自我改进语言模型。 该系统通过为代理提供一个最小的LLM训练设置(nanochat)和一套指令(`program.md`)来工作。代理修改训练代码(`train.py`),运行5分钟的训练周期,评估结果(使用每字节验证比特数),并重复——有效地“编程程序”而不是直接编程代码。 该仓库故意保持简单,包含三个关键文件:固定的数据准备脚本(`prepare.py`),可编辑的训练循环(`train.py`),以及代理的指令(`program.md`)。这允许集中实验和可管理的代码更改。目前需要单个NVIDIA GPU,该项目旨在展示一种新的范式,即AI驱动自身的进化,目前已进行到第10,205次代码生成(尽管无法验证)。

## LLM 自动研究:Hacker News 摘要 最近 Hacker News 的讨论集中在 Andrej Karpathy 的“autoresearch”项目(github.com/karpathy)上,该项目使用 LLM 自动研究如何改进 nanochat 模型。该系统在一块 GPU 上训练,修改代码和超参数以优化性能。 对话探讨了这种方法的潜力以及风险。担忧范围从自动化恶意软件创建和自我改进的 AI 导致不可预见的后果(甚至“奇点”)到更实际的问题,即它是否仅仅是复杂的超参数调整。许多评论员强调了 AI 研究本身的自动化程度日益提高,一些人已经在尝试类似的自我改进代理。 一个关键的观察是,当前的 LLM 表现出一种“谨慎”的不愿探索真正新颖的研究方向,并且该项目的初始“改进”有时涉及看似随机的更改,例如更改随机种子。尽管如此,该项目被视为朝着完全开放和自我改进的 AI 模型迈出的重要一步,即使资源有限,也具有重大影响的潜力。讨论还涉及为社会准备应对日益强大的 AI 带来的不可避免的变化的必要性。
相关文章

原文

teaser

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026.

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet.

The repo is deliberately kept small and only really has a three files that matter:

  • prepare.py — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
  • train.py — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. This file is edited and iterated on by the agent.
  • program.md — baseline instructions for one agent. Point your agent here and let it go. This file is edited and iterated on by the human.

By design, training runs for a fixed 5-minute time budget (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is val_bpb (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

Requirements: A single NVIDIA GPU (tested on H100), Python 3.10+, uv.

# 1. Install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Manually run a single training experiment (~5 min)
uv run train.py

If the above commands all work ok, your setup is working and you can go into autonomous research mode.

Platforms support. This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. The code is just a demonstration and I don't know how much I'll support it going forward. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.

Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The program.md file is essentially a super lightweight "skill".

prepare.py      — constants, data prep + runtime utilities (do not modify)
train.py        — model, optimizer, training loop (agent modifies this)
program.md      — agent instructions
pyproject.toml  — dependencies
  • Single file to modify. The agent only touches train.py. This keeps the scope manageable and diffs reviewable.
  • Fixed time budget. Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
  • Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

MIT

联系我们 contact @ memedata.com