Patterns for Building LLM-based Systems & Products
6,449 words → 1669 · 24 min saved
Build reliable LLM applications by mastering prompting techniques, RAG pipelines, deterministic workflows, and rigorous evaluation strategies.
Seven Patterns Overview
| Pattern |
Purpose |
Spectrum Position |
| Evals |
Measure performance |
Closer to data |
| RAG |
Add recent, external knowledge |
Closer to data |
| Fine-tuning |
Get better at specific tasks |
Closer to data |
| Caching |
Reduce latency and cost |
Cost/risk reduction |
| Guardrails |
Ensure output quality |
Cost/risk reduction |
| Defensive UX |
Anticipate and handle errors gracefully |
Closer to users |
| Collect user feedback |
Build data flywheel |
Closer to users |
Evals: Why and How
- Evals measure system performance and detect regressions; without them teams fly blind or manually inspect outputs with each change
- Context-dependent metrics are task-specific and require adjustment when repurposed; context-free metrics compare output against gold references and are task-agnostic
- BLEU is precision-based (n-gram overlap), used for machine translation, includes brevity penalty. ROUGE is recall-oriented, used for summarization (variants: ROUGE-N, ROUGE-L, ROUGE-S). BERTScore uses cosine similarity on embeddings, accounts for synonyms unlike BLEU/ROUGE. MoverScore enables many-to-one matching via constrained optimization
- Major pitfall: same model gets significantly different scores based on eval implementation. Huggingface found original MMLU, HELM, and EleutherAI implementations used different prompts for the same examples. QLoRA author concluded 'do not work with/report or trust MMLU scores'
LLM-Based Evaluation Methods
| Method |
Approach |
Key Finding |
| G-Eval |
LLM with Chain-of-Thought scores outputs 1-5, token probabilities normalize scores |
GPT-4 as evaluator had Spearman correlation 0.514 with human judgments, outperforming traditional metrics |
| Vicuna |
GPT-4 rates chatbot answers on helpfulness, relevance, accuracy, detail across 8 categories |
GPT-4 had higher agreement with humans (85%) than humans had among themselves (81%) |
| QLoRA |
GPT-4 scores model pairs out of 10 with explanations, three-class rating including ties |
Spearman rank correlation 0.55 at model level vs Mechanical Turk, suggesting LLM evals could replace human evals |
LLM Evaluation Biases
| Bias |
Description |
Mitigation |
| Position bias |
LLMs favor first position |
Evaluate same pair twice with swapped order; same preferred both times = win, else tie |
| Verbosity bias |
Favor longer responses |
Ensure comparison responses are similar length |
| Self-enhancement bias |
Slight bias toward own answers (GPT-4: 10% higher, Claude-v1: 25% higher) |
Don't use same LLM for evaluation |
RAG: Core Mechanism
RAG reduces hallucination by grounding models on retrieved context, increasing factuality
↓
Cheaper to update retrieval indices than continuously pre-train LLMs, enabling access to recent data
↓
Updating or removing biased/toxic documents simpler than fine-tuning LLMs not to generate such content
↓
Sequoia survey: 88% of respondents believe retrieval will be key stack component
RAG Historical Development
Early
- Meta paper: TF-IDF retrieval + BERT context improved open-domain QA
- Dense Passage Retrieval (DPR): dense embeddings outperform BM25 (65.2% vs 42.9% top-5 accuracy). Two independent BERT encoders trained for dot-product similarity, indexed via FAISS
RAG
- RAG paper: dense vector retrieval (non-parametric) + pre-trained LLM (parametric, BART 400M). RAG-Sequence uses same document for complete sequence generation. RAG-Token generates each token from different documents with per-token retrieval
FiD
- Fusion-in-Decoder: processes passages independently in encoder (linear scaling, not quadratic), decoder attends concatenation of all retrieved passages
RETRO
- Retrieval-Enhanced Transformer: retrieval throughout pre-training, not just inference. Splits input into 64-token chunks, retrieves based on previous chunk. Uses L2 distance on BERT embeddings (departure from cosine/dot product). SCaNN queries 2T token database in 10ms. RETRO-fitting existing models: train <10% weights for 7B model, surpass baseline
Internet-augmented
- Off-the-shelf Google Search augmenting LLMs. Used Gopher (280B params). Product-of-Experts (PoE) consistently best selection method
HyDE
- Hypothetical Document Embeddings: LLM generates hypothetical document from query, encoder embeds it, retrieves real documents by similarity. Reframes relevance modeling from representation learning to generation task
CodeT5+
- RAG applied to code generation: retrieval-augmented mode (append top-1 code sample to encoder input) significantly outperforms generative-only mode
RAG Retrieval: Sparse vs Dense vs Hybrid
| Keyword Search (BM25) |
Embedding-based Search |
Hybrid (Recommended) |
| Models simple word frequencies only |
Captures semantic/correlation information |
Combines both approaches |
| Handles exact names, acronyms, IDs well (Eugene, RAG, gpt-3.5-turbo) |
Falls short on exact names, acronyms, IDs |
Handles both exact match and semantic queries |
| No synonym/hypernym handling |
Handles synonyms and paraphrasing |
Full coverage |
| Enables metadata filtering (date, category, ratings) |
Pure semantic similarity |
Metadata filtering available for downstream ranking |
Embedding Models and ANN Indices
| Category |
Option |
Notes |
| Embedding |
FastText |
Open-source, lightweight, 157 languages, no GPU needed. Go-to for early proofs of concept |
| Embedding |
sentence-transformers |
Based on BERT/RoBERTA, 100+ languages, solid baseline |
| Embedding |
Instructor models |
SOTA: prepend task descriptions for task-specific embeddings. Custom prompts: 'Represent the [domain] [task_type] for [task_objective]:' |
| Embedding |
E5 family |
Prepend 'passage:' for documents, 'query:' for queries in retrieval; 'query:' for symmetric tasks |
| Embedding |
GTE (Alibaba DAMO) |
Top MTEB as of Aug 1st. Half size of next best: 0.67GB vs 1.34GB |
| ANN Index |
LSH |
Hash functions where similar items hash identically. Supports adding new items without full reindex |
| ANN Index |
FAISS |
Quantization + indexing. CPU/GPU, handles billions of vectors |
| ANN Index |
HNSW |
Hierarchical graph with coarse-to-fine search |
| ANN Index |
ScaNN |
Coarse quantization then fine-grained search. Best recall/latency tradeoff observed, but requires rebuild for new items |
Fine-tuning Taxonomy
mindmap
root((Fine-tuning))
Continued pre-training
Same pre-training regime with domain-specific data
Instruction fine-tuning
Instruction-output pair examples
InstructGPT: 13k samples SFT, 33k comparisons reward model, 31k prompts RLHF
Single-task fine-tuning
Narrow specific tasks
Avoids alignment tax
RLHF
Human preference pairwise comparisons
Reward model + PPO
Fine-tuning Techniques
| Technique |
Mechanism |
Efficiency |
| Soft prompt tuning |
Prepends trainable tensor to input embeddings, learned via backpropagation |
Trains only soft prompt parameters |
| Prefix tuning |
Prepends trainable parameters to all transformer block hidden states, freezes original LM params |
0.1% of parameters. Outperformed full fine-tuning in limited data and new topic extrapolation |
| Adapter |
Adds fully connected layers twice per transformer block (after attention, after FFN) |
3.6% parameters per task, within 0.4% of full fine-tuning on GLUE |
| LoRA |
Two low-rank matrices product as adapters. Based on finding that pre-trained LMs have low intrinsic dimension |
Outperformed full fine-tuning (implicit regularization from reduced rank) |
| QLoRA |
LoRA on 4-bit quantized model. Innovations: 4-bit NormalFloat, double quantization, paged optimizers |
Reduced 65B model fine-tuning from >780GB to 48GB without degrading performance |
Transfer Learning to Fine-tuning Evolution
ULMFit
- Established self-supervised pre-training then fine-tuning protocol. AWD-LSTM pre-trained on wikitext-103 (103M words), then LM fine-tuned on task domain, then classifier fine-tuned
BERT
- Encoder-only. Pre-trained: masked language modeling + next sentence prediction on Wikipedia + BooksCorpus. Fine-tuned with task-specific heads for classification, tagging, QA
GPT
- Decoder-only. Pre-trained on BooksCorpus via next token prediction. Including language modeling as auxiliary objective helped generalize and converge faster
T5
- Encoder-decoder. Pre-trained on C4 with denoising objective. All downstream tasks as text-to-text with prefix prompts ('Translate English to German:', 'Summarize:'). Single fine-tuned model across variety of tasks
InstructGPT
- Expanded single-task to instruction fine-tuning. SFT on demonstrations, reward model on comparisons, PPO optimization. Alignment tax: RLHF led performance regressions on SQuAD, HellaSwag, WMT vs GPT-3 base
Caching Strategy Selection
flowchart TD
A[New request received] --> B[Generate embedding]
B --> C{Similar to cached request?}
C -->|Yes| D[Serve cached response]
C -->|No| E[Send to LLM]
E --> F[Serve and cache response]
G[Choose cache key strategy] --> H{Usage pattern?}
H -->|Power law distribution| I[Caching effective]
H -->|Uniformly random| J[Cache ineffective: frequent updates negate benefits]
I --> K{Input type?}
K -->|Item IDs| L[Pre-compute per item: product review summaries]
K -->|Item ID pairs| M[Pre-compute pairs: movie comparisons for popular combinations]
K -->|Constrained inputs| N[Pre-compute from variables: genre, director, actor]
K -->|Semantic similarity| O[Risky: 'Mission Impossible 2' may match 'Mission Impossible 3']
Guardrails Categories
| Category |
What It Checks |
Examples |
| Structural guidance |
Output conforms to specific format |
Microsoft Guidance injects structure tokens instead of relying on LLM to generate correct format. Token healing rewinds one token to avoid tokenization boundary bugs |
| Syntactic |
Output values within valid ranges |
Categorical output in acceptable choice sets, SQL syntax error-free with columns matching schema, generated code validity |
| Content safety |
No harmful/inappropriate content |
String list comparison, profanity detection models, moderation classifiers on output |
| Semantic/factuality |
Output relevant and accurate to input |
Cosine similarity or fuzzy matching against reference document, LLM verifying summary represents source |
| Input guardrails |
Limit input types model responds to |
Moderation classifier on input, string matching. Midjourney returns errors for NSFW requests |
Defensive UX: Three Guidelines Compared
| Pattern |
Microsoft |
Google |
Apple |
| Set expectations |
Make clear how well system does what it does |
Be transparent about capable/incapable |
Describe limitations in marketing/feature context |
| Enable dismissal |
Support efficient dismissal of undesired AI services (G8) |
— |
— |
| Provide attribution |
Make clear why system did what it did |
Add human source context to help appraise recommendations |
Consider attributions distinguishing results |
| Anchor familiarity |
— |
Anchor on familiarity when onboarding |
— |
| Overall emphasis |
Mental models (HCI academic study) |
Training data/model development (engineering culture) |
Seamless UX (cultural values/principles) |
Defensive UX: Chat as Interface
- Higher user effort (chat, search) leads to higher expectations that are harder to meet. Netflix found users have higher recommendation expectations from explicit actions (search) vs passive (scrolling, clicking)
- Chat offers flexibility but demands effort and lacks adjustment signifiers. Familiar, constrained UI makes navigation easier; chat should be secondary or tertiary option
User Feedback Collection: Explicit vs Implicit
flowchart TD
A[User Feedback] --> B[Explicit]
A --> C[Implicit]
B --> D[Thumbs up/down on responses]
B --> E[Regenerate response = negative]
B --> F[Selection from options]
C --> G[Copilot: accept suggestion = strong positive]
C --> H[Copilot: accept with tweaks = positive]
C --> I[Copilot: ignore suggestion = neutral/negative]
C --> J[Midjourney: generate new images = negative]
C --> K[Midjourney: tweak variation = positive]
C --> L[Midjourney: upscale/download = strong positive]
D --> M[Human preference data for fine-tuning]
G --> M
J --> N[Rich comparison data on outputs]
K --> N
L --> N
Additional ML Patterns and Community Insights
| Pattern |
Description |
| Data flywheel |
Continuous collection improves models, better UX, increased usage provides more data for eval/fine-tuning: virtuous cycle |
| Cascade |
Break complex tasks so LLM only handles what it excels at (reasoning, eloquent communication); augment with external knowledge for retrieval/ranking |
| Monitoring |
Demonstrates AI value added or lack. Example: LLM-based customer support discontinued after two weeks in prod because A/B test showed 12x losses vs human support team |
| Task decomposition |
Distinct prompts for subtasks, chaining helps attention/reliability but hurts latency. Splitting rigid output structure from variable response content resolved reliability issues |
| Security concerns |
Cache poisoning, input validation, prompt injection, training data provenance, malicious input to AI agent tools, denial of service via LLM stress test |
| Output consistency |
Standardized format (JSON), self-consistency sampling, multiple model output ensembling, offload to specialist proven models |