Build a complete GPT from scratch in 200 lines of pure Python, understanding every component from autograd to attention to Adam, and see exactly how each piece scales up to production LLMs.
| Component | Implementation |
|---|---|
| Dataset | 32,000 names from names.txt, each name is a document |
| Tokenizer | Character-level: 26 lowercase letters + 1 BOS token = vocab size 27 |
| Autograd | Value class wrapping scalars, tracks computation graph, implements backpropagation via chain rule |
| Architecture | GPT-2-like: n_embd=16, n_head=4, n_layer=1, block_size=16, totaling 4,192 parameters |
| Optimizer | Adam with linear learning rate decay |
| Training | 1,000 steps, one document per step, loss drops from ~3.3 (random) to ~2.37 |
| Inference | Temperature-controlled sampling (default 0.5), generates plausible new names |
graph TD
A["Each Value wraps a scalar (.data)\nand tracks children + local gradients"] --> B["Math ops (add, mul, pow, log, exp, relu)\ncreate new Values recording inputs\nand local derivatives"]
B --> C["Forward pass builds\ncomputation graph"]
C --> D["backward() walks graph\nin reverse topological order"]
D --> E["Chain rule: child.grad += local_grad * parent.grad\n(gradients accumulate via += when graph branches)"]
E --> F["Every parameter gets .grad:\nhow the loss changes if that parameter is nudged"]| Operation | Forward | Local gradients |
|---|---|---|
| a + b | a + b | ∂/∂a = 1, ∂/∂b = 1 |
| a * b | a · b | ∂/∂a = b, ∂/∂b = a |
| a ** n | aⁿ | ∂/∂a = n · aⁿ⁻¹ |
| log(a) | ln(a) | ∂/∂a = 1/a |
| exp(a) | eᵃ | ∂/∂a = eᵃ |
| relu(a) | max(0, a) | ∂/∂a = 1 if a > 0, else 0 |
Pick a document, tokenize it, wrap with BOS on both sides (e.g. "emma" → [BOS, e, m, m, a, BOS])
↓
Feed tokens through model one at a time, building KV cache; at each position compute cross-entropy loss: -log p(correct next token)
↓
Average per-position losses into single scalar loss
↓
loss.backward() backpropagates through entire computation graph, giving every parameter a .grad
↓
Adam optimizer updates each parameter using momentum (m) and adaptive learning rate (v), with bias correction and linear LR decay
↓
Reset all gradients to 0 for next step
| Component | microgpt | Production (e.g. ChatGPT) |
|---|---|---|
| Data | 32K short names | Trillions of tokens of internet text, deduplicated and quality-filtered |
| Tokenizer | Single characters, vocab 27 | Subword BPE, vocab ~100K tokens |
| Autograd | Scalar Value objects in pure Python | Tensors on GPUs/TPUs via PyTorch, CUDA kernels like FlashAttention |
| Architecture | 4,192 params, 1 layer, 16-dim embeddings | Hundreds of billions of params, 100+ layers, 10,000+ dim; adds RoPE, GQA, gated activations, MoE |
| Training | 1 document per step, 1,000 steps | Millions of tokens per batch, mixed precision, thousands of GPUs for months |
| Optimization | Adam with linear LR decay | Extensive tuning guided by scaling laws (e.g. Chinchilla); wrong settings waste millions of dollars |
| Post-training | None | SFT on curated conversations, then RL from human/model feedback |
| Inference | Single token at a time in Python | Batching, KV cache paging (vLLM), speculative decoding, quantization (int8/int4), multi-GPU |
| File | What it adds |
|---|---|
| train0.py | Bigram count table, no neural net, no gradients |
| train1.py | MLP + manual gradients (numerical & analytic) + SGD |
| train2.py | Autograd (Value class) replaces manual gradients |
| train3.py | Position embeddings + single-head attention + rmsnorm + residuals |
| train4.py | Multi-head attention + layer loop, full GPT architecture |
| train5.py (= train.py) | Adam optimizer |
| Question | Answer |
|---|---|
| Does the model "understand" anything? | No magic: it's a math function mapping input tokens to a probability distribution over the next token. Whether this constitutes understanding is philosophical. |
| Why does it work? | Thousands of parameters are nudged each step to lower the loss. Over many steps they capture statistical regularities (consonant patterns, common letter pairs). No explicit rules, just a learned probability distribution. |
| How is this related to ChatGPT? | Same core loop (predict next token, sample, repeat) scaled up enormously with post-training to make it conversational. |
| What's the deal with hallucinations? | The model samples from statistical plausibility, not truth. microgpt hallucinating "karia" is the same phenomenon as ChatGPT stating a false fact. |
| Can I make it generate better names? | Train longer (increase num_steps), make the model bigger (n_embd, n_layer, n_head), or use a larger dataset. Same knobs that matter at scale. |