microgpt: A 200-Line Pure Python GPT

6,507 words → 837 · 29 min saved

Build a complete GPT from scratch in 200 lines of pure Python, understanding every component from autograd to attention to Adam, and see exactly how each piece scales up to production LLMs.

What microgpt contains

Component Implementation
Dataset 32,000 names from names.txt, each name is a document
Tokenizer Character-level: 26 lowercase letters + 1 BOS token = vocab size 27
Autograd Value class wrapping scalars, tracks computation graph, implements backpropagation via chain rule
Architecture GPT-2-like: n_embd=16, n_head=4, n_layer=1, block_size=16, totaling 4,192 parameters
Optimizer Adam with linear learning rate decay
Training 1,000 steps, one document per step, loss drops from ~3.3 (random) to ~2.37
Inference Temperature-controlled sampling (default 0.5), generates plausible new names

How a token flows through the model

GPT forward pass (stateless function: token + position + KV cache → logits)
☐ Look up token embedding (wte) and position embedding (wpe), add them together
☐ RMSNorm the combined embedding
☐ Attention block: project to Q, K, V; append K,V to cache; compute scaled dot-product attention per head; concatenate heads; project through attn_wo; add residual
☐ MLP block: RMSNorm → linear to 4x dimension → ReLU → linear back down; add residual
☐ Project final hidden state through lm_head to produce 27 logits (one per vocab token)

Autograd: how backpropagation works

graph TD
    A["Each Value wraps a scalar (.data)\nand tracks children + local gradients"] --> B["Math ops (add, mul, pow, log, exp, relu)\ncreate new Values recording inputs\nand local derivatives"]
    B --> C["Forward pass builds\ncomputation graph"]
    C --> D["backward() walks graph\nin reverse topological order"]
    D --> E["Chain rule: child.grad += local_grad * parent.grad\n(gradients accumulate via += when graph branches)"]
    E --> F["Every parameter gets .grad:\nhow the loss changes if that parameter is nudged"]

Autograd operations (lego blocks)

Operation Forward Local gradients
a + b a + b ∂/∂a = 1, ∂/∂b = 1
a * b a · b ∂/∂a = b, ∂/∂b = a
a ** n aⁿ ∂/∂a = n · aⁿ⁻¹
log(a) ln(a) ∂/∂a = 1/a
exp(a) eᵃ ∂/∂a = eᵃ
relu(a) max(0, a) ∂/∂a = 1 if a > 0, else 0

Training loop logic

Pick a document, tokenize it, wrap with BOS on both sides (e.g. "emma" → [BOS, e, m, m, a, BOS])

Feed tokens through model one at a time, building KV cache; at each position compute cross-entropy loss: -log p(correct next token)

Average per-position losses into single scalar loss

loss.backward() backpropagates through entire computation graph, giving every parameter a .grad

Adam optimizer updates each parameter using momentum (m) and adaptive learning rate (v), with bias correction and linear LR decay

Reset all gradients to 0 for next step

Key architectural concepts

  • Attention is the only place a token at position t looks at tokens 0..t-1; it is a token communication mechanism. Query = "what am I looking for?", Key = "what do I contain?", Value = "what do I offer if selected?"
  • MLP is a two-layer feed-forward network (project up 4x, ReLU, project back) where the model does per-position "thinking"; unlike attention, fully local to time t
  • Residual connections (adding block output back to input) let gradients flow directly through the network, making deeper models trainable
  • The KV cache is used during training here (unusual but conceptually always present); cached keys/values are live Value nodes so backpropagation flows through them
  • Temperature controls sampling randomness: dividing logits by temperature before softmax. Lower sharpens distribution (more conservative), higher flattens it (more diverse)

microgpt vs production LLMs

Component microgpt Production (e.g. ChatGPT)
Data 32K short names Trillions of tokens of internet text, deduplicated and quality-filtered
Tokenizer Single characters, vocab 27 Subword BPE, vocab ~100K tokens
Autograd Scalar Value objects in pure Python Tensors on GPUs/TPUs via PyTorch, CUDA kernels like FlashAttention
Architecture 4,192 params, 1 layer, 16-dim embeddings Hundreds of billions of params, 100+ layers, 10,000+ dim; adds RoPE, GQA, gated activations, MoE
Training 1 document per step, 1,000 steps Millions of tokens per batch, mixed precision, thousands of GPUs for months
Optimization Adam with linear LR decay Extensive tuning guided by scaling laws (e.g. Chinchilla); wrong settings waste millions of dollars
Post-training None SFT on curated conversations, then RL from human/model feedback
Inference Single token at a time in Python Batching, KV cache paging (vLLM), speculative decoding, quantization (int8/int4), multi-GPU

Progression of training files

File What it adds
train0.py Bigram count table, no neural net, no gradients
train1.py MLP + manual gradients (numerical & analytic) + SGD
train2.py Autograd (Value class) replaces manual gradients
train3.py Position embeddings + single-head attention + rmsnorm + residuals
train4.py Multi-head attention + layer loop, full GPT architecture
train5.py (= train.py) Adam optimizer

FAQ

Question Answer
Does the model "understand" anything? No magic: it's a math function mapping input tokens to a probability distribution over the next token. Whether this constitutes understanding is philosophical.
Why does it work? Thousands of parameters are nudged each step to lower the loss. Over many steps they capture statistical regularities (consonant patterns, common letter pairs). No explicit rules, just a learned probability distribution.
How is this related to ChatGPT? Same core loop (predict next token, sample, repeat) scaled up enormously with post-training to make it conversational.
What's the deal with hallucinations? The model samples from statistical plausibility, not truth. microgpt hallucinating "karia" is the same phenomenon as ChatGPT stating a false fact.
Can I make it generate better names? Train longer (increase num_steps), make the model bigger (n_embd, n_layer, n_head), or use a larger dataset. Same knobs that matter at scale.
Read original · ← Archive