microgpt: A 200-Line Pure Python GPT

karpathy.ai · April 1, 2026

6,507 words → 837 · 29 min saved

Build a complete GPT from scratch in 200 lines of pure Python, understanding every component from autograd to attention to Adam, and see exactly how each piece scales up to production LLMs.

What microgpt contains

Component	Implementation
Dataset	32,000 names from names.txt, each name is a document
Tokenizer	Character-level: 26 lowercase letters + 1 BOS token = vocab size 27
Autograd	Value class wrapping scalars, tracks computation graph, implements backpropagation via chain rule
Architecture	GPT-2-like: n_embd=16, n_head=4, n_layer=1, block_size=16, totaling 4,192 parameters
Optimizer	Adam with linear learning rate decay
Training	1,000 steps, one document per step, loss drops from ~3.3 (random) to ~2.37
Inference	Temperature-controlled sampling (default 0.5), generates plausible new names

How a token flows through the model

GPT forward pass (stateless function: token + position + KV cache → logits)

☐ Look up token embedding (wte) and position embedding (wpe), add them together

☐ RMSNorm the combined embedding

☐ Attention block: project to Q, K, V; append K,V to cache; compute scaled dot-product attention per head; concatenate heads; project through attn_wo; add residual

☐ MLP block: RMSNorm → linear to 4x dimension → ReLU → linear back down; add residual

☐ Project final hidden state through lm_head to produce 27 logits (one per vocab token)

Autograd: how backpropagation works

graph TD
    A["Each Value wraps a scalar (.data)\nand tracks children + local gradients"] --> B["Math ops (add, mul, pow, log, exp, relu)\ncreate new Values recording inputs\nand local derivatives"]
    B --> C["Forward pass builds\ncomputation graph"]
    C --> D["backward() walks graph\nin reverse topological order"]
    D --> E["Chain rule: child.grad += local_grad * parent.grad\n(gradients accumulate via += when graph branches)"]
    E --> F["Every parameter gets .grad:\nhow the loss changes if that parameter is nudged"]

Autograd operations (lego blocks)

Operation	Forward	Local gradients
a + b	a + b	∂/∂a = 1, ∂/∂b = 1
a * b	a · b	∂/∂a = b, ∂/∂b = a
a ** n	aⁿ	∂/∂a = n · aⁿ⁻¹
log(a)	ln(a)	∂/∂a = 1/a
exp(a)	eᵃ	∂/∂a = eᵃ
relu(a)	max(0, a)	∂/∂a = 1 if a > 0, else 0

Training loop logic

Pick a document, tokenize it, wrap with BOS on both sides (e.g. "emma" → [BOS, e, m, m, a, BOS])

↓

Feed tokens through model one at a time, building KV cache; at each position compute cross-entropy loss: -log p(correct next token)

↓

Average per-position losses into single scalar loss

↓

loss.backward() backpropagates through entire computation graph, giving every parameter a .grad

↓

Adam optimizer updates each parameter using momentum (m) and adaptive learning rate (v), with bias correction and linear LR decay

↓

Reset all gradients to 0 for next step

Key architectural concepts

Attention is the only place a token at position t looks at tokens 0..t-1; it is a token communication mechanism. Query = "what am I looking for?", Key = "what do I contain?", Value = "what do I offer if selected?"
MLP is a two-layer feed-forward network (project up 4x, ReLU, project back) where the model does per-position "thinking"; unlike attention, fully local to time t
Residual connections (adding block output back to input) let gradients flow directly through the network, making deeper models trainable
The KV cache is used during training here (unusual but conceptually always present); cached keys/values are live Value nodes so backpropagation flows through them
Temperature controls sampling randomness: dividing logits by temperature before softmax. Lower sharpens distribution (more conservative), higher flattens it (more diverse)

microgpt vs production LLMs

Component	microgpt	Production (e.g. ChatGPT)
Data	32K short names	Trillions of tokens of internet text, deduplicated and quality-filtered
Tokenizer	Single characters, vocab 27	Subword BPE, vocab ~100K tokens
Autograd	Scalar Value objects in pure Python	Tensors on GPUs/TPUs via PyTorch, CUDA kernels like FlashAttention
Architecture	4,192 params, 1 layer, 16-dim embeddings	Hundreds of billions of params, 100+ layers, 10,000+ dim; adds RoPE, GQA, gated activations, MoE
Training	1 document per step, 1,000 steps	Millions of tokens per batch, mixed precision, thousands of GPUs for months
Optimization	Adam with linear LR decay	Extensive tuning guided by scaling laws (e.g. Chinchilla); wrong settings waste millions of dollars
Post-training	None	SFT on curated conversations, then RL from human/model feedback
Inference	Single token at a time in Python	Batching, KV cache paging (vLLM), speculative decoding, quantization (int8/int4), multi-GPU

Progression of training files

File	What it adds
train0.py	Bigram count table, no neural net, no gradients
train1.py	MLP + manual gradients (numerical & analytic) + SGD
train2.py	Autograd (Value class) replaces manual gradients
train3.py	Position embeddings + single-head attention + rmsnorm + residuals
train4.py	Multi-head attention + layer loop, full GPT architecture
train5.py (= train.py)	Adam optimizer

FAQ

Question	Answer
Does the model "understand" anything?	No magic: it's a math function mapping input tokens to a probability distribution over the next token. Whether this constitutes understanding is philosophical.
Why does it work?	Thousands of parameters are nudged each step to lower the loss. Over many steps they capture statistical regularities (consonant patterns, common letter pairs). No explicit rules, just a learned probability distribution.
How is this related to ChatGPT?	Same core loop (predict next token, sample, repeat) scaled up enormously with post-training to make it conversational.
What's the deal with hallucinations?	The model samples from statistical plausibility, not truth. microgpt hallucinating "karia" is the same phenomenon as ChatGPT stating a false fact.
Can I make it generate better names?	Train longer (increase num_steps), make the model bigger (n_embd, n_layer, n_head), or use a larger dataset. Same knobs that matter at scale.

Read original · ← Archive