Scaling Autoresearch with 16 GPUs via SkyPilot

2,701 words → 872 · 10 min saved

Scale Karpathy's autoresearch from one GPU to 16 using SkyPilot, running ~910 experiments in 8 hours with factorial grid search that catches interaction effects greedy sequential search would miss.

Core Result

  • Claude Code + SkyPilot + 16 GPUs on Kubernetes ran ~910 experiments in ~8 hours, driving val_bpb (validation bits per byte) from 1.003 to 0.974 (2.87% improvement over baseline)
  • Parallelism changed how the agent searched: with 1 GPU it does greedy hill-climbing; with 16 GPUs it ran factorial grids of 10-13 experiments per wave, catching interaction effects sequential search would miss
  • The agent discovered it had access to both H100s and H200s and self-invented a two-tier strategy: screen ideas on H100s, promote winners to H200 for validation
  • Parallel agent reached the same best validation loss 9x faster than simulated sequential baseline (~8 hours vs ~72 hours)

Autoresearch Project Structure

File Role Agent Access
prepare.py Downloads data, trains tokenizer, provides dataloader and evaluation function Read-only
train.py GPT model, optimizer, and training loop Only file the agent modifies
program.md Instructions: what to change, how to evaluate, when to keep vs discard Read-only

The Autoresearch Loop

graph TD
    A[Agent edits train.py ~30s] --> B[Training runs ~5 min]
    B --> C[Agent reads result, plans next ~30s]
    C --> D{Keep or discard?}
    D -->|val_bpb improved| E[Commit change]
    D -->|No improvement| F[Discard change]
    E --> A
    F --> A

Sequential vs Parallel Search

Sequential (1 GPU) Parallel (16 GPUs)
~10 experiments/hour ~90 experiments/hour
Greedy hill-climbing: try one thing, check, repeat Factorial grids: test 3 values × 4 values = 12 experiments in one 5-min wave
1 experiment per decision 10-13 simultaneous experiments per decision
Testing 6 aspect ratios = 30 min waiting Testing 6 aspect ratios = one 5-min wave
Easier to get stuck in local optima Catches interaction effects between parameters

Five Phases of Optimization

Phase 1 (~exp 1-200)
  • Hyperparameter sweeps: batch size, Adam betas, weight decay, window patterns, depth, LR schedules
  • Findings: halving batch size to 2^18 helped (more optimizer steps); Adam betas (0.9, 0.95) beat default; weight decay 0.08 better than 0.2; softcap 10 gave small improvement; depth 10+ crashed or hurt
  • val_bpb: 1.003 → 0.981 (Δ = 0.022)
Phase 2 (~exp 200-420)
  • Architecture discovery: tested 6 aspect ratios (AR=48, 64, 72, 80, 90, 96) simultaneously
  • Scaling model width from AR~48 (model_dim=384) to AR=96 (model_dim=768) outperformed every hyperparameter tweak from Phase 1
  • AR=112 too big (not enough training steps in 5 min); AR=96 sweet spot: fit in 64GB VRAM, ~1,060 steps on H100
  • val_bpb: 0.981 → 0.977 (Δ = 0.004)
Phase 3 (~exp 420-560)
  • Fine-tuning wider model: warmdown schedule, matrix LR, weight decay, Newton-Schulz steps for Muon optimizer
  • val_bpb: 0.977 → 0.975 (Δ = 0.002)
Phase 4 (~exp 560-700)
  • Biggest late-stage find: muon_beta2=0.98 (up from 0.95). Higher beta2 smoothed gradient normalization, allowed larger effective steps
  • Tested beta2 in {0.95, 0.96, 0.97, 0.98, 0.99} across 10 clusters in one wave (5 min vs 25 min sequential)
  • val_bpb: 0.975 → 0.974 (Δ = 0.001)
Phase 5 (~exp 700-910)
  • Diminishing returns: combinatorial sweeps over final LR fraction, warmdown ratio, scalar LR, embedding LR
  • Returns dropped below 0.0001 per experiment. Further gains would require new architectural ideas or longer training budgets

Best Configuration Found

ASPECT_RATIO = 96      # model_dim = 8 * 96 = 768
DEPTH = 8              # 8 transformer layers
WINDOW_PATTERN = "SL"  # alternating Sliding + Local attention
TOTAL_BATCH_SIZE = 2**18  # ~524K tokens/step
MATRIX_LR = 0.05       # Muon LR for weight matrices
EMBEDDING_LR = 0.6     # AdamW LR for token embeddings
SCALAR_LR = 0.5        # AdamW LR for residual mixing scalars
ADAM_BETAS = (0.70, 0.95)
WEIGHT_DECAY = 0.08
WARMDOWN_RATIO = 0.6
FINAL_LR_FRAC = 0.05
# Muon: momentum=0.95, ns_steps=5, beta2=0.98

Final configuration after ~910 experiments, achieving val_bpb = 0.974

Emergent Hardware Exploitation Strategy

  • Of 16 clusters, agent got 13 H100s (80GB VRAM, ~283ms/step) and 3 H200s (141GB VRAM, ~263ms/step). The agent was not told about performance differences.
  • After a few waves, the agent noticed identical configs scored ~0.005 val_bpb lower on H200 because faster step time meant more training steps in the 5-min budget (H200 runs ~9% more steps than H100)
  • Without prompting, the agent developed a two-tier strategy: screen 10+ hypotheses cheaply on H100s in parallel, promote top 2-3 to H200 for confirmation
  • Rankings didn't always transfer across hardware. FINAL_LR_FRAC=0.03 sometimes beat 0.05 on H100 but consistently lost on H200. Likely explanation: with more steps, the model benefits from keeping LR higher toward end of schedule. The agent's self-invented validation tier caught these discrepancies.

Infrastructure and Cost

Component Detail
Orchestration SkyPilot (open-source): launches jobs across clouds and Kubernetes from YAML, includes a skill that teaches coding agents to use it
Parallelism mechanism sky launch provisions clusters; sky exec pipelines experiments on same cluster with zero idle time; -d flag for detached mode
Supported backends Kubernetes, AWS, GCP, Azure, SLURM (20+ infra backends)
Claude Code API cost ~$9 for 8-hour session
GPU compute cost 13 H100s × 8h × ~$2/h + 3 H200s × 8h × ~$2.3/h = under $300 total

Getting Started

Quick Setup
☐ Run the one-line setup script: curl -sL https://raw.githubusercontent.com/.../setup.sh | bash
☐ cd autoresearch
☐ Launch: claude "Read instructions.md and start running parallel experiments."
☐ Agent handles everything: reads instructions, fetches SkyPilot skill, provisions clusters, submits experiments, commits winners, loops
Manual Setup
☐ Clone autoresearch and skypilot repos
☐ Copy experiment.yaml and instructions.md into autoresearch/
☐ Run pip install uv && uv sync && uv run prepare.py
☐ Install the SkyPilot skill for your agent
☐ Set infra: in YAML to target backend (k8s, aws, etc.) or let SkyPilot pick cheapest
☐ Works with Claude Code, Codex, or any coding agent that can run shell commands
Read original · ← Archive