autoresearch

github.com/karpathy/autoresearch · April 1, 2026

1,206 words → 439 · 4 min saved

Set up an autonomous AI research loop that modifies a GPT training script, evaluates results against a single metric, and runs ~100 experiments overnight on a single GPU with no human intervention.

Core Idea

Give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.
You are not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org.
The default program.md is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress.

Autonomous Research Loop

flowchart TD
    A[Agent reads program.md] --> B[Modifies train.py]
    B --> C[Trains for 5 min wall clock]
    C --> D[Checks val_bpb\nvalidation bits per byte\nlower is better]
    D -->|Improved| E[Keep changes]
    D -->|No improvement| F[Discard changes]
    E --> B
    F --> B

Project Structure

File	Role	Who edits
prepare.py	Fixed constants, one-time data prep (downloads training data, trains BPE tokenizer), runtime utilities (dataloader, evaluation)	Not modified
train.py	Full GPT model, optimizer (Muon + AdamW), training loop. Everything fair game: architecture, hyperparameters, optimizer, batch size	Agent
program.md	Baseline instructions for one agent. A super lightweight 'skill'	Human

Quick Start

Setup (requires single NVIDIA GPU, Python 3.10+, uv)

☐ Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh

☐ Install dependencies: uv sync

☐ Download data and train tokenizer (one-time, ~2 min): uv run prepare.py

☐ Manual single training run (~5 min): uv run train.py

Autonomous mode

☐ Spin up Claude/Codex in the repo (disable all permissions)

☐ Prompt: 'have a look at program.md and let's kick off a new experiment'

Design Choices

Choice	Rationale
Single file to modify (train.py only)	Keeps scope manageable and diffs reviewable
Fixed 5-min time budget (wall clock, excluding startup/compilation)	Makes experiments directly comparable regardless of what the agent changes. Finds the most optimal model for your platform in that time budget. ~12 experiments/hour, ~100 overnight. Downside: results not comparable across different compute platforms
Self-contained (PyTorch + few small packages)	No distributed training, no complex configs. One GPU, one file, one metric

Tuning for Smaller Compute

Parameter	Recommendation
Dataset	Use lower-entropy data like TinyStories (GPT-4 generated short stories) for reasonable results with smaller models
vocab_size	Decrease from 8192 down to 4096, 2048, 1024, or byte-level (256)
MAX_SEQ_LEN (prepare.py)	Lower significantly, even down to 256. May want to increase DEVICE_BATCH_SIZE in train.py to compensate (tokens per fwd/bwd = product of both)
EVAL_TOKENS (prepare.py)	Decrease so validation loss evaluates on less data
DEPTH (train.py)	Primary knob for model complexity (default 8). Lower to e.g. 4
WINDOW_PATTERN	Use just 'L' instead of 'SSSL' (alternating banded attention may be very inefficient)
TOTAL_BATCH_SIZE	Lower significantly, keep powers of 2, e.g. down to 2^14 (~16K)

Read original · ← Archive