Patterns for Building LLM-based Systems & Products

6,449 words → 1669 · 24 min saved

Build reliable LLM applications by mastering prompting techniques, RAG pipelines, deterministic workflows, and rigorous evaluation strategies.

Seven Patterns Overview

Pattern Purpose Spectrum Position
Evals Measure performance Closer to data
RAG Add recent, external knowledge Closer to data
Fine-tuning Get better at specific tasks Closer to data
Caching Reduce latency and cost Cost/risk reduction
Guardrails Ensure output quality Cost/risk reduction
Defensive UX Anticipate and handle errors gracefully Closer to users
Collect user feedback Build data flywheel Closer to users

Evals: Why and How

  • Evals measure system performance and detect regressions; without them teams fly blind or manually inspect outputs with each change
  • Context-dependent metrics are task-specific and require adjustment when repurposed; context-free metrics compare output against gold references and are task-agnostic
  • BLEU is precision-based (n-gram overlap), used for machine translation, includes brevity penalty. ROUGE is recall-oriented, used for summarization (variants: ROUGE-N, ROUGE-L, ROUGE-S). BERTScore uses cosine similarity on embeddings, accounts for synonyms unlike BLEU/ROUGE. MoverScore enables many-to-one matching via constrained optimization
  • Major pitfall: same model gets significantly different scores based on eval implementation. Huggingface found original MMLU, HELM, and EleutherAI implementations used different prompts for the same examples. QLoRA author concluded 'do not work with/report or trust MMLU scores'

LLM-Based Evaluation Methods

Method Approach Key Finding
G-Eval LLM with Chain-of-Thought scores outputs 1-5, token probabilities normalize scores GPT-4 as evaluator had Spearman correlation 0.514 with human judgments, outperforming traditional metrics
Vicuna GPT-4 rates chatbot answers on helpfulness, relevance, accuracy, detail across 8 categories GPT-4 had higher agreement with humans (85%) than humans had among themselves (81%)
QLoRA GPT-4 scores model pairs out of 10 with explanations, three-class rating including ties Spearman rank correlation 0.55 at model level vs Mechanical Turk, suggesting LLM evals could replace human evals

LLM Evaluation Biases

Bias Description Mitigation
Position bias LLMs favor first position Evaluate same pair twice with swapped order; same preferred both times = win, else tie
Verbosity bias Favor longer responses Ensure comparison responses are similar length
Self-enhancement bias Slight bias toward own answers (GPT-4: 10% higher, Claude-v1: 25% higher) Don't use same LLM for evaluation

RAG: Core Mechanism

RAG reduces hallucination by grounding models on retrieved context, increasing factuality

Cheaper to update retrieval indices than continuously pre-train LLMs, enabling access to recent data

Updating or removing biased/toxic documents simpler than fine-tuning LLMs not to generate such content

Sequoia survey: 88% of respondents believe retrieval will be key stack component

RAG Historical Development

Early
  • Meta paper: TF-IDF retrieval + BERT context improved open-domain QA
  • Dense Passage Retrieval (DPR): dense embeddings outperform BM25 (65.2% vs 42.9% top-5 accuracy). Two independent BERT encoders trained for dot-product similarity, indexed via FAISS
RAG
  • RAG paper: dense vector retrieval (non-parametric) + pre-trained LLM (parametric, BART 400M). RAG-Sequence uses same document for complete sequence generation. RAG-Token generates each token from different documents with per-token retrieval
FiD
  • Fusion-in-Decoder: processes passages independently in encoder (linear scaling, not quadratic), decoder attends concatenation of all retrieved passages
RETRO
  • Retrieval-Enhanced Transformer: retrieval throughout pre-training, not just inference. Splits input into 64-token chunks, retrieves based on previous chunk. Uses L2 distance on BERT embeddings (departure from cosine/dot product). SCaNN queries 2T token database in 10ms. RETRO-fitting existing models: train <10% weights for 7B model, surpass baseline
Internet-augmented
  • Off-the-shelf Google Search augmenting LLMs. Used Gopher (280B params). Product-of-Experts (PoE) consistently best selection method
HyDE
  • Hypothetical Document Embeddings: LLM generates hypothetical document from query, encoder embeds it, retrieves real documents by similarity. Reframes relevance modeling from representation learning to generation task
CodeT5+
  • RAG applied to code generation: retrieval-augmented mode (append top-1 code sample to encoder input) significantly outperforms generative-only mode

RAG Retrieval: Sparse vs Dense vs Hybrid

Keyword Search (BM25) Embedding-based Search Hybrid (Recommended)
Models simple word frequencies only Captures semantic/correlation information Combines both approaches
Handles exact names, acronyms, IDs well (Eugene, RAG, gpt-3.5-turbo) Falls short on exact names, acronyms, IDs Handles both exact match and semantic queries
No synonym/hypernym handling Handles synonyms and paraphrasing Full coverage
Enables metadata filtering (date, category, ratings) Pure semantic similarity Metadata filtering available for downstream ranking

Embedding Models and ANN Indices

Category Option Notes
Embedding FastText Open-source, lightweight, 157 languages, no GPU needed. Go-to for early proofs of concept
Embedding sentence-transformers Based on BERT/RoBERTA, 100+ languages, solid baseline
Embedding Instructor models SOTA: prepend task descriptions for task-specific embeddings. Custom prompts: 'Represent the [domain] [task_type] for [task_objective]:'
Embedding E5 family Prepend 'passage:' for documents, 'query:' for queries in retrieval; 'query:' for symmetric tasks
Embedding GTE (Alibaba DAMO) Top MTEB as of Aug 1st. Half size of next best: 0.67GB vs 1.34GB
ANN Index LSH Hash functions where similar items hash identically. Supports adding new items without full reindex
ANN Index FAISS Quantization + indexing. CPU/GPU, handles billions of vectors
ANN Index HNSW Hierarchical graph with coarse-to-fine search
ANN Index ScaNN Coarse quantization then fine-grained search. Best recall/latency tradeoff observed, but requires rebuild for new items

Fine-tuning Taxonomy

mindmap
  root((Fine-tuning))
    Continued pre-training
      Same pre-training regime with domain-specific data
    Instruction fine-tuning
      Instruction-output pair examples
      InstructGPT: 13k samples SFT, 33k comparisons reward model, 31k prompts RLHF
    Single-task fine-tuning
      Narrow specific tasks
      Avoids alignment tax
    RLHF
      Human preference pairwise comparisons
      Reward model + PPO

Fine-tuning Techniques

Technique Mechanism Efficiency
Soft prompt tuning Prepends trainable tensor to input embeddings, learned via backpropagation Trains only soft prompt parameters
Prefix tuning Prepends trainable parameters to all transformer block hidden states, freezes original LM params 0.1% of parameters. Outperformed full fine-tuning in limited data and new topic extrapolation
Adapter Adds fully connected layers twice per transformer block (after attention, after FFN) 3.6% parameters per task, within 0.4% of full fine-tuning on GLUE
LoRA Two low-rank matrices product as adapters. Based on finding that pre-trained LMs have low intrinsic dimension Outperformed full fine-tuning (implicit regularization from reduced rank)
QLoRA LoRA on 4-bit quantized model. Innovations: 4-bit NormalFloat, double quantization, paged optimizers Reduced 65B model fine-tuning from >780GB to 48GB without degrading performance

Transfer Learning to Fine-tuning Evolution

ULMFit
  • Established self-supervised pre-training then fine-tuning protocol. AWD-LSTM pre-trained on wikitext-103 (103M words), then LM fine-tuned on task domain, then classifier fine-tuned
BERT
  • Encoder-only. Pre-trained: masked language modeling + next sentence prediction on Wikipedia + BooksCorpus. Fine-tuned with task-specific heads for classification, tagging, QA
GPT
  • Decoder-only. Pre-trained on BooksCorpus via next token prediction. Including language modeling as auxiliary objective helped generalize and converge faster
T5
  • Encoder-decoder. Pre-trained on C4 with denoising objective. All downstream tasks as text-to-text with prefix prompts ('Translate English to German:', 'Summarize:'). Single fine-tuned model across variety of tasks
InstructGPT
  • Expanded single-task to instruction fine-tuning. SFT on demonstrations, reward model on comparisons, PPO optimization. Alignment tax: RLHF led performance regressions on SQuAD, HellaSwag, WMT vs GPT-3 base

Caching Strategy Selection

flowchart TD
    A[New request received] --> B[Generate embedding]
    B --> C{Similar to cached request?}
    C -->|Yes| D[Serve cached response]
    C -->|No| E[Send to LLM]
    E --> F[Serve and cache response]
    G[Choose cache key strategy] --> H{Usage pattern?}
    H -->|Power law distribution| I[Caching effective]
    H -->|Uniformly random| J[Cache ineffective: frequent updates negate benefits]
    I --> K{Input type?}
    K -->|Item IDs| L[Pre-compute per item: product review summaries]
    K -->|Item ID pairs| M[Pre-compute pairs: movie comparisons for popular combinations]
    K -->|Constrained inputs| N[Pre-compute from variables: genre, director, actor]
    K -->|Semantic similarity| O[Risky: 'Mission Impossible 2' may match 'Mission Impossible 3']

Guardrails Categories

Category What It Checks Examples
Structural guidance Output conforms to specific format Microsoft Guidance injects structure tokens instead of relying on LLM to generate correct format. Token healing rewinds one token to avoid tokenization boundary bugs
Syntactic Output values within valid ranges Categorical output in acceptable choice sets, SQL syntax error-free with columns matching schema, generated code validity
Content safety No harmful/inappropriate content String list comparison, profanity detection models, moderation classifiers on output
Semantic/factuality Output relevant and accurate to input Cosine similarity or fuzzy matching against reference document, LLM verifying summary represents source
Input guardrails Limit input types model responds to Moderation classifier on input, string matching. Midjourney returns errors for NSFW requests

Defensive UX: Three Guidelines Compared

Pattern Microsoft Google Apple
Set expectations Make clear how well system does what it does Be transparent about capable/incapable Describe limitations in marketing/feature context
Enable dismissal Support efficient dismissal of undesired AI services (G8)
Provide attribution Make clear why system did what it did Add human source context to help appraise recommendations Consider attributions distinguishing results
Anchor familiarity Anchor on familiarity when onboarding
Overall emphasis Mental models (HCI academic study) Training data/model development (engineering culture) Seamless UX (cultural values/principles)

Defensive UX: Chat as Interface

  • Higher user effort (chat, search) leads to higher expectations that are harder to meet. Netflix found users have higher recommendation expectations from explicit actions (search) vs passive (scrolling, clicking)
  • Chat offers flexibility but demands effort and lacks adjustment signifiers. Familiar, constrained UI makes navigation easier; chat should be secondary or tertiary option

User Feedback Collection: Explicit vs Implicit

flowchart TD
    A[User Feedback] --> B[Explicit]
    A --> C[Implicit]
    B --> D[Thumbs up/down on responses]
    B --> E[Regenerate response = negative]
    B --> F[Selection from options]
    C --> G[Copilot: accept suggestion = strong positive]
    C --> H[Copilot: accept with tweaks = positive]
    C --> I[Copilot: ignore suggestion = neutral/negative]
    C --> J[Midjourney: generate new images = negative]
    C --> K[Midjourney: tweak variation = positive]
    C --> L[Midjourney: upscale/download = strong positive]
    D --> M[Human preference data for fine-tuning]
    G --> M
    J --> N[Rich comparison data on outputs]
    K --> N
    L --> N

Additional ML Patterns and Community Insights

Pattern Description
Data flywheel Continuous collection improves models, better UX, increased usage provides more data for eval/fine-tuning: virtuous cycle
Cascade Break complex tasks so LLM only handles what it excels at (reasoning, eloquent communication); augment with external knowledge for retrieval/ranking
Monitoring Demonstrates AI value added or lack. Example: LLM-based customer support discontinued after two weeks in prod because A/B test showed 12x losses vs human support team
Task decomposition Distinct prompts for subtasks, chaining helps attention/reliability but hurts latency. Splitting rigid output structure from variable response content resolved reliability issues
Security concerns Cache poisoning, input validation, prompt injection, training data provenance, malicious input to AI agent tools, denial of service via LLM stress test
Output consistency Standardized format (JSON), self-consistency sampling, multiple model output ensembling, offload to specialist proven models
Read original · ← Archive