Scaling Autoresearch with 16 GPUs via SkyPilot

blog.skypilot.co · April 1, 2026

2,701 words → 872 · 10 min saved

Scale Karpathy's autoresearch from one GPU to 16 using SkyPilot, running ~910 experiments in 8 hours with factorial grid search that catches interaction effects greedy sequential search would miss.

Core Result

Claude Code + SkyPilot + 16 GPUs on Kubernetes ran ~910 experiments in ~8 hours, driving val_bpb (validation bits per byte) from 1.003 to 0.974 (2.87% improvement over baseline)
Parallelism changed how the agent searched: with 1 GPU it does greedy hill-climbing; with 16 GPUs it ran factorial grids of 10-13 experiments per wave, catching interaction effects sequential search would miss
The agent discovered it had access to both H100s and H200s and self-invented a two-tier strategy: screen ideas on H100s, promote winners to H200 for validation
Parallel agent reached the same best validation loss 9x faster than simulated sequential baseline (~8 hours vs ~72 hours)

Autoresearch Project Structure

File	Role	Agent Access
prepare.py	Downloads data, trains tokenizer, provides dataloader and evaluation function	Read-only
train.py	GPT model, optimizer, and training loop	Only file the agent modifies
program.md	Instructions: what to change, how to evaluate, when to keep vs discard	Read-only

The Autoresearch Loop

graph TD
    A[Agent edits train.py ~30s] --> B[Training runs ~5 min]
    B --> C[Agent reads result, plans next ~30s]
    C --> D{Keep or discard?}
    D -->|val_bpb improved| E[Commit change]
    D -->|No improvement| F[Discard change]
    E --> A
    F --> A

Sequential vs Parallel Search

Sequential (1 GPU)	Parallel (16 GPUs)
~10 experiments/hour	~90 experiments/hour
Greedy hill-climbing: try one thing, check, repeat	Factorial grids: test 3 values × 4 values = 12 experiments in one 5-min wave
1 experiment per decision	10-13 simultaneous experiments per decision
Testing 6 aspect ratios = 30 min waiting	Testing 6 aspect ratios = one 5-min wave
Easier to get stuck in local optima	Catches interaction effects between parameters

Five Phases of Optimization

Phase 1 (~exp 1-200)

Hyperparameter sweeps: batch size, Adam betas, weight decay, window patterns, depth, LR schedules
Findings: halving batch size to 2^18 helped (more optimizer steps); Adam betas (0.9, 0.95) beat default; weight decay 0.08 better than 0.2; softcap 10 gave small improvement; depth 10+ crashed or hurt
val_bpb: 1.003 → 0.981 (Δ = 0.022)

Phase 2 (~exp 200-420)

Architecture discovery: tested 6 aspect ratios (AR=48, 64, 72, 80, 90, 96) simultaneously
Scaling model width from AR~48 (model_dim=384) to AR=96 (model_dim=768) outperformed every hyperparameter tweak from Phase 1
AR=112 too big (not enough training steps in 5 min); AR=96 sweet spot: fit in 64GB VRAM, ~1,060 steps on H100
val_bpb: 0.981 → 0.977 (Δ = 0.004)

Phase 3 (~exp 420-560)

Fine-tuning wider model: warmdown schedule, matrix LR, weight decay, Newton-Schulz steps for Muon optimizer
val_bpb: 0.977 → 0.975 (Δ = 0.002)

Phase 4 (~exp 560-700)

Biggest late-stage find: muon_beta2=0.98 (up from 0.95). Higher beta2 smoothed gradient normalization, allowed larger effective steps
Tested beta2 in {0.95, 0.96, 0.97, 0.98, 0.99} across 10 clusters in one wave (5 min vs 25 min sequential)
val_bpb: 0.975 → 0.974 (Δ = 0.001)

Phase 5 (~exp 700-910)

Diminishing returns: combinatorial sweeps over final LR fraction, warmdown ratio, scalar LR, embedding LR
Returns dropped below 0.0001 per experiment. Further gains would require new architectural ideas or longer training budgets

Best Configuration Found

ASPECT_RATIO = 96      # model_dim = 8 * 96 = 768
DEPTH = 8              # 8 transformer layers
WINDOW_PATTERN = "SL"  # alternating Sliding + Local attention
TOTAL_BATCH_SIZE = 2**18  # ~524K tokens/step
MATRIX_LR = 0.05       # Muon LR for weight matrices
EMBEDDING_LR = 0.6     # AdamW LR for token embeddings
SCALAR_LR = 0.5        # AdamW LR for residual mixing scalars
ADAM_BETAS = (0.70, 0.95)
WEIGHT_DECAY = 0.08
WARMDOWN_RATIO = 0.6
FINAL_LR_FRAC = 0.05
# Muon: momentum=0.95, ns_steps=5, beta2=0.98

Final configuration after ~910 experiments, achieving val_bpb = 0.974

Emergent Hardware Exploitation Strategy

Of 16 clusters, agent got 13 H100s (80GB VRAM, ~283ms/step) and 3 H200s (141GB VRAM, ~263ms/step). The agent was not told about performance differences.
After a few waves, the agent noticed identical configs scored ~0.005 val_bpb lower on H200 because faster step time meant more training steps in the 5-min budget (H200 runs ~9% more steps than H100)
Without prompting, the agent developed a two-tier strategy: screen 10+ hypotheses cheaply on H100s in parallel, promote top 2-3 to H200 for confirmation
Rankings didn't always transfer across hardware. FINAL_LR_FRAC=0.03 sometimes beat 0.05 on H100 but consistently lost on H200. Likely explanation: with more steps, the model benefits from keeping LR higher toward end of schedule. The agent's self-invented validation tier caught these discrepancies.

Infrastructure and Cost

Component	Detail
Orchestration	SkyPilot (open-source): launches jobs across clouds and Kubernetes from YAML, includes a skill that teaches coding agents to use it
Parallelism mechanism	sky launch provisions clusters; sky exec pipelines experiments on same cluster with zero idle time; -d flag for detached mode
Supported backends	Kubernetes, AWS, GCP, Azure, SLURM (20+ infra backends)
Claude Code API cost	~$9 for 8-hour session
GPU compute cost	13 H100s × 8h × ~$2/h + 3 H200s × 8h × ~$2.3/h = under $300 total

Getting Started

Quick Setup

☐ Run the one-line setup script: curl -sL https://raw.githubusercontent.com/.../setup.sh | bash

☐ cd autoresearch

☐ Launch: claude "Read instructions.md and start running parallel experiments."

☐ Agent handles everything: reads instructions, fetches SkyPilot skill, provisions clusters, submits experiments, commits winners, loops

Manual Setup

☐ Clone autoresearch and skypilot repos

☐ Copy experiment.yaml and instructions.md into autoresearch/

☐ Run pip install uv && uv sync && uv run prepare.py

☐ Install the SkyPilot skill for your agent

☐ Set infra: in YAML to target backend (k8s, aws, etc.) or let SkyPilot pick cheapest

☐ Works with Claude Code, Codex, or any coding agent that can run shell commands

Read original · ← Archive