autoresearch
1,206 words → 439 · 4 min saved
Set up an autonomous AI research loop that modifies a GPT training script, evaluates results against a single metric, and runs ~100 experiments overnight on a single GPU with no human intervention.
Core Idea
- Give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats.
- You are not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org.
- The default program.md is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress.
Autonomous Research Loop
flowchart TD
A[Agent reads program.md] --> B[Modifies train.py]
B --> C[Trains for 5 min wall clock]
C --> D[Checks val_bpb\nvalidation bits per byte\nlower is better]
D -->|Improved| E[Keep changes]
D -->|No improvement| F[Discard changes]
E --> B
F --> B
Project Structure
| File |
Role |
Who edits |
| prepare.py |
Fixed constants, one-time data prep (downloads training data, trains BPE tokenizer), runtime utilities (dataloader, evaluation) |
Not modified |
| train.py |
Full GPT model, optimizer (Muon + AdamW), training loop. Everything fair game: architecture, hyperparameters, optimizer, batch size |
Agent |
| program.md |
Baseline instructions for one agent. A super lightweight 'skill' |
Human |
Quick Start
Setup (requires single NVIDIA GPU, Python 3.10+, uv)
☐ Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
☐ Install dependencies: uv sync
☐ Download data and train tokenizer (one-time, ~2 min): uv run prepare.py
☐ Manual single training run (~5 min): uv run train.py
Autonomous mode
☐ Spin up Claude/Codex in the repo (disable all permissions)
☐ Prompt: 'have a look at program.md and let's kick off a new experiment'
Design Choices
| Choice |
Rationale |
| Single file to modify (train.py only) |
Keeps scope manageable and diffs reviewable |
| Fixed 5-min time budget (wall clock, excluding startup/compilation) |
Makes experiments directly comparable regardless of what the agent changes. Finds the most optimal model for your platform in that time budget. ~12 experiments/hour, ~100 overnight. Downside: results not comparable across different compute platforms |
| Self-contained (PyTorch + few small packages) |
No distributed training, no complex configs. One GPU, one file, one metric |
Tuning for Smaller Compute
| Parameter |
Recommendation |
| Dataset |
Use lower-entropy data like TinyStories (GPT-4 generated short stories) for reasonable results with smaller models |
| vocab_size |
Decrease from 8192 down to 4096, 2048, 1024, or byte-level (256) |
| MAX_SEQ_LEN (prepare.py) |
Lower significantly, even down to 256. May want to increase DEVICE_BATCH_SIZE in train.py to compensate (tokens per fwd/bwd = product of both) |
| EVAL_TOKENS (prepare.py) |
Decrease so validation loss evaluates on less data |
| DEPTH (train.py) |
Primary knob for model complexity (default 8). Lower to e.g. 4 |
| WINDOW_PATTERN |
Use just 'L' instead of 'SSSL' (alternating banded attention may be very inefficient) |
| TOTAL_BATCH_SIZE |
Lower significantly, keep powers of 2, e.g. down to 2^14 (~16K) |