Patterns for Building LLM-based Systems & Products

Seven Patterns Overview

Pattern	Purpose	Spectrum Position
Evals	Measure performance	Closer to data
RAG	Add recent, external knowledge	Closer to data
Fine-tuning	Get better at specific tasks	Closer to data
Caching	Reduce latency and cost	Cost/risk reduction
Guardrails	Ensure output quality	Cost/risk reduction
Defensive UX	Anticipate and handle errors gracefully	Closer to users
Collect user feedback	Build data flywheel	Closer to users

Evals: Why and How

Evals measure system performance and detect regressions; without them teams fly blind or manually inspect outputs with each change
Context-dependent metrics are task-specific and require adjustment when repurposed; context-free metrics compare output against gold references and are task-agnostic
BLEU is precision-based (n-gram overlap), used for machine translation, includes brevity penalty. ROUGE is recall-oriented, used for summarization (variants: ROUGE-N, ROUGE-L, ROUGE-S). BERTScore uses cosine similarity on embeddings, accounts for synonyms unlike BLEU/ROUGE. MoverScore enables many-to-one matching via constrained optimization
Major pitfall: same model gets significantly different scores based on eval implementation. Huggingface found original MMLU, HELM, and EleutherAI implementations used different prompts for the same examples. QLoRA author concluded 'do not work with/report or trust MMLU scores'

LLM-Based Evaluation Methods

Method	Approach	Key Finding
G-Eval	LLM with Chain-of-Thought scores outputs 1-5, token probabilities normalize scores	GPT-4 as evaluator had Spearman correlation 0.514 with human judgments, outperforming traditional metrics
Vicuna	GPT-4 rates chatbot answers on helpfulness, relevance, accuracy, detail across 8 categories	GPT-4 had higher agreement with humans (85%) than humans had among themselves (81%)
QLoRA	GPT-4 scores model pairs out of 10 with explanations, three-class rating including ties	Spearman rank correlation 0.55 at model level vs Mechanical Turk, suggesting LLM evals could replace human evals

LLM Evaluation Biases

Bias	Description	Mitigation
Position bias	LLMs favor first position	Evaluate same pair twice with swapped order; same preferred both times = win, else tie
Verbosity bias	Favor longer responses	Ensure comparison responses are similar length
Self-enhancement bias	Slight bias toward own answers (GPT-4: 10% higher, Claude-v1: 25% higher)	Don't use same LLM for evaluation

RAG: Core Mechanism

RAG reduces hallucination by grounding models on retrieved context, increasing factuality

↓

Cheaper to update retrieval indices than continuously pre-train LLMs, enabling access to recent data

↓

Updating or removing biased/toxic documents simpler than fine-tuning LLMs not to generate such content

↓

Sequoia survey: 88% of respondents believe retrieval will be key stack component

RAG Historical Development

Early

Meta paper: TF-IDF retrieval + BERT context improved open-domain QA
Dense Passage Retrieval (DPR): dense embeddings outperform BM25 (65.2% vs 42.9% top-5 accuracy). Two independent BERT encoders trained for dot-product similarity, indexed via FAISS

RAG

RAG paper: dense vector retrieval (non-parametric) + pre-trained LLM (parametric, BART 400M). RAG-Sequence uses same document for complete sequence generation. RAG-Token generates each token from different documents with per-token retrieval

FiD

Fusion-in-Decoder: processes passages independently in encoder (linear scaling, not quadratic), decoder attends concatenation of all retrieved passages

RETRO

Retrieval-Enhanced Transformer: retrieval throughout pre-training, not just inference. Splits input into 64-token chunks, retrieves based on previous chunk. Uses L2 distance on BERT embeddings (departure from cosine/dot product). SCaNN queries 2T token database in 10ms. RETRO-fitting existing models: train <10% weights for 7B model, surpass baseline

Internet-augmented

Off-the-shelf Google Search augmenting LLMs. Used Gopher (280B params). Product-of-Experts (PoE) consistently best selection method

HyDE

Hypothetical Document Embeddings: LLM generates hypothetical document from query, encoder embeds it, retrieves real documents by similarity. Reframes relevance modeling from representation learning to generation task

CodeT5+

RAG applied to code generation: retrieval-augmented mode (append top-1 code sample to encoder input) significantly outperforms generative-only mode

RAG Retrieval: Sparse vs Dense vs Hybrid

Keyword Search (BM25)	Embedding-based Search	Hybrid (Recommended)
Models simple word frequencies only	Captures semantic/correlation information	Combines both approaches
Handles exact names, acronyms, IDs well (Eugene, RAG, gpt-3.5-turbo)	Falls short on exact names, acronyms, IDs	Handles both exact match and semantic queries
No synonym/hypernym handling	Handles synonyms and paraphrasing	Full coverage
Enables metadata filtering (date, category, ratings)	Pure semantic similarity	Metadata filtering available for downstream ranking

Embedding Models and ANN Indices

Category	Option	Notes
Embedding	FastText	Open-source, lightweight, 157 languages, no GPU needed. Go-to for early proofs of concept
Embedding	sentence-transformers	Based on BERT/RoBERTA, 100+ languages, solid baseline
Embedding	Instructor models	SOTA: prepend task descriptions for task-specific embeddings. Custom prompts: 'Represent the [domain] [task_type] for [task_objective]:'
Embedding	E5 family	Prepend 'passage:' for documents, 'query:' for queries in retrieval; 'query:' for symmetric tasks
Embedding	GTE (Alibaba DAMO)	Top MTEB as of Aug 1st. Half size of next best: 0.67GB vs 1.34GB
ANN Index	LSH	Hash functions where similar items hash identically. Supports adding new items without full reindex
ANN Index	FAISS	Quantization + indexing. CPU/GPU, handles billions of vectors
ANN Index	HNSW	Hierarchical graph with coarse-to-fine search
ANN Index	ScaNN	Coarse quantization then fine-grained search. Best recall/latency tradeoff observed, but requires rebuild for new items

Fine-tuning Taxonomy

mindmap
  root((Fine-tuning))
    Continued pre-training
      Same pre-training regime with domain-specific data
    Instruction fine-tuning
      Instruction-output pair examples
      InstructGPT: 13k samples SFT, 33k comparisons reward model, 31k prompts RLHF
    Single-task fine-tuning
      Narrow specific tasks
      Avoids alignment tax
    RLHF
      Human preference pairwise comparisons
      Reward model + PPO

Fine-tuning Techniques

Technique	Mechanism	Efficiency
Soft prompt tuning	Prepends trainable tensor to input embeddings, learned via backpropagation	Trains only soft prompt parameters
Prefix tuning	Prepends trainable parameters to all transformer block hidden states, freezes original LM params	0.1% of parameters. Outperformed full fine-tuning in limited data and new topic extrapolation
Adapter	Adds fully connected layers twice per transformer block (after attention, after FFN)	3.6% parameters per task, within 0.4% of full fine-tuning on GLUE
LoRA	Two low-rank matrices product as adapters. Based on finding that pre-trained LMs have low intrinsic dimension	Outperformed full fine-tuning (implicit regularization from reduced rank)
QLoRA	LoRA on 4-bit quantized model. Innovations: 4-bit NormalFloat, double quantization, paged optimizers	Reduced 65B model fine-tuning from >780GB to 48GB without degrading performance

Transfer Learning to Fine-tuning Evolution

ULMFit

Established self-supervised pre-training then fine-tuning protocol. AWD-LSTM pre-trained on wikitext-103 (103M words), then LM fine-tuned on task domain, then classifier fine-tuned

BERT

Encoder-only. Pre-trained: masked language modeling + next sentence prediction on Wikipedia + BooksCorpus. Fine-tuned with task-specific heads for classification, tagging, QA

GPT

Decoder-only. Pre-trained on BooksCorpus via next token prediction. Including language modeling as auxiliary objective helped generalize and converge faster

T5

Encoder-decoder. Pre-trained on C4 with denoising objective. All downstream tasks as text-to-text with prefix prompts ('Translate English to German:', 'Summarize:'). Single fine-tuned model across variety of tasks

InstructGPT

Expanded single-task to instruction fine-tuning. SFT on demonstrations, reward model on comparisons, PPO optimization. Alignment tax: RLHF led performance regressions on SQuAD, HellaSwag, WMT vs GPT-3 base

Caching Strategy Selection

flowchart TD
    A[New request received] --> B[Generate embedding]
    B --> C{Similar to cached request?}
    C -->|Yes| D[Serve cached response]
    C -->|No| E[Send to LLM]
    E --> F[Serve and cache response]
    G[Choose cache key strategy] --> H{Usage pattern?}
    H -->|Power law distribution| I[Caching effective]
    H -->|Uniformly random| J[Cache ineffective: frequent updates negate benefits]
    I --> K{Input type?}
    K -->|Item IDs| L[Pre-compute per item: product review summaries]
    K -->|Item ID pairs| M[Pre-compute pairs: movie comparisons for popular combinations]
    K -->|Constrained inputs| N[Pre-compute from variables: genre, director, actor]
    K -->|Semantic similarity| O[Risky: 'Mission Impossible 2' may match 'Mission Impossible 3']

Guardrails Categories

Category	What It Checks	Examples
Structural guidance	Output conforms to specific format	Microsoft Guidance injects structure tokens instead of relying on LLM to generate correct format. Token healing rewinds one token to avoid tokenization boundary bugs
Syntactic	Output values within valid ranges	Categorical output in acceptable choice sets, SQL syntax error-free with columns matching schema, generated code validity
Content safety	No harmful/inappropriate content	String list comparison, profanity detection models, moderation classifiers on output
Semantic/factuality	Output relevant and accurate to input	Cosine similarity or fuzzy matching against reference document, LLM verifying summary represents source
Input guardrails	Limit input types model responds to	Moderation classifier on input, string matching. Midjourney returns errors for NSFW requests

Defensive UX: Three Guidelines Compared

Pattern	Microsoft	Google	Apple
Set expectations	Make clear how well system does what it does	Be transparent about capable/incapable	Describe limitations in marketing/feature context
Enable dismissal	Support efficient dismissal of undesired AI services (G8)	—	—
Provide attribution	Make clear why system did what it did	Add human source context to help appraise recommendations	Consider attributions distinguishing results
Anchor familiarity	—	Anchor on familiarity when onboarding	—
Overall emphasis	Mental models (HCI academic study)	Training data/model development (engineering culture)	Seamless UX (cultural values/principles)

Defensive UX: Chat as Interface

Higher user effort (chat, search) leads to higher expectations that are harder to meet. Netflix found users have higher recommendation expectations from explicit actions (search) vs passive (scrolling, clicking)
Chat offers flexibility but demands effort and lacks adjustment signifiers. Familiar, constrained UI makes navigation easier; chat should be secondary or tertiary option

User Feedback Collection: Explicit vs Implicit

flowchart TD
    A[User Feedback] --> B[Explicit]
    A --> C[Implicit]
    B --> D[Thumbs up/down on responses]
    B --> E[Regenerate response = negative]
    B --> F[Selection from options]
    C --> G[Copilot: accept suggestion = strong positive]
    C --> H[Copilot: accept with tweaks = positive]
    C --> I[Copilot: ignore suggestion = neutral/negative]
    C --> J[Midjourney: generate new images = negative]
    C --> K[Midjourney: tweak variation = positive]
    C --> L[Midjourney: upscale/download = strong positive]
    D --> M[Human preference data for fine-tuning]
    G --> M
    J --> N[Rich comparison data on outputs]
    K --> N
    L --> N

Additional ML Patterns and Community Insights

Pattern	Description
Data flywheel	Continuous collection improves models, better UX, increased usage provides more data for eval/fine-tuning: virtuous cycle
Cascade	Break complex tasks so LLM only handles what it excels at (reasoning, eloquent communication); augment with external knowledge for retrieval/ranking
Monitoring	Demonstrates AI value added or lack. Example: LLM-based customer support discontinued after two weeks in prod because A/B test showed 12x losses vs human support team
Task decomposition	Distinct prompts for subtasks, chaining helps attention/reliability but hurts latency. Splitting rigid output structure from variable response content resolved reliability issues
Security concerns	Cache poisoning, input validation, prompt injection, training data provenance, malicious input to AI agent tools, denial of service via LLM stress test
Output consistency	Standardized format (JSON), self-consistency sampling, multiple model output ensembling, offload to specialist proven models