home / skills / melodic-software / claude-code-plugins / llm-serving-patterns
npx playbooks add skill melodic-software/claude-code-plugins --skill llm-serving-patternsReview the files below or copy the command above to add this skill to your agents.
---
name: llm-serving-patterns
description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
allowed-tools: Read, Glob, Grep
---
# LLM Serving Patterns
## When to Use This Skill
Use this skill when:
- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively
**Keywords:** LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
## LLM Serving Architecture Overview
```text
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Serving Stack │
├─────────────────────────────────────────────────────────────────────┤
│ Clients (API, Chat UI, Agents) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Load Balancer / API Gateway │ │
│ │ • Rate limiting • Authentication • Request routing │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inference Server │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Request │ │ Batching │ │ KV Cache │ │ │
│ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Model Execution Engine │ │ │
│ │ │ • Tensor operations • Attention • Token sampling │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GPU/TPU Cluster │ │
│ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
## Serving Framework Comparison
| Framework | Strengths | Best For | Considerations |
| --------- | --------- | -------- | -------------- |
| **vLLM** | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| **TGI (Text Generation Inference)** | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| **TensorRT-LLM** | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| **Triton Inference Server** | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| **Ollama** | Simple local deployment | Development, edge deployment | Limited scaling features |
| **llama.cpp** | CPU inference, quantization | Resource-constrained, edge | C++ integration required |
### Framework Selection Decision Tree
```text
Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
└── Need high throughput with many concurrent users?
├── Yes → vLLM (PagedAttention)
└── No
└── Need enterprise features + HF integration?
├── Yes → TGI
└── No
└── Simple local/edge deployment?
├── Yes → Ollama or llama.cpp
└── No → vLLM (general purpose)
```
## Quantization Techniques
### Precision Levels
| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
| --------- | ---- | ---------------- | -------------- | -------- |
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |
### Quantization Methods
| Method | Description | Quality | Speed |
| ------ | ----------- | ------- | ----- |
| **PTQ (Post-Training Quantization)** | Quantize after training, no retraining | Good | Fast to apply |
| **QAT (Quantization-Aware Training)** | Simulate quantization during training | Better | Requires training |
| **GPTQ** | One-shot weight quantization | Very good | Moderate |
| **AWQ (Activation-aware Weight Quantization)** | Preserves salient weights | Excellent | Moderate |
| **GGUF/GGML** | llama.cpp format, CPU-optimized | Good | Very fast inference |
| **SmoothQuant** | Migrates difficulty to weights | Excellent | Moderate |
### Quantization Selection
```text
Quality vs. Efficiency Trade-off:
Quality ────────────────────────────────────────────▶ Efficiency
│ │
│ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │
│ ○───────○────────○──────────○──────────○──────○ │
│ │ │ │ │ │ │ │
│ Best Great Good Good Fair Poor │
│ │
```
## Batching Strategies
### Static Batching
```text
Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80] ─┘
Problem: Short requests wait for long ones (head-of-line blocking)
```
### Continuous Batching (Preferred)
```text
Time ──────────────────────────────────────────────────────────▶
Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]
• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization
```
### Batching Parameters
| Parameter | Description | Trade-off |
| --------- | ----------- | --------- |
| `max_batch_size` | Maximum concurrent requests | Memory vs. throughput |
| `max_waiting_tokens` | Tokens before forcing batch | Latency vs. throughput |
| `max_num_seqs` | Maximum sequences in batch | Memory vs. concurrency |
## KV Cache Management
### The KV Cache Problem
```text
Attention: Q × K^T × V
For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)
Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory
```
### PagedAttention (vLLM Innovation)
```text
Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed) │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE │
└──────────────────────────────────────────┘
PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput
```
### KV Cache Optimization Strategies
| Strategy | Description | Memory Savings |
| -------- | ----------- | -------------- |
| **Paged Attention** | Virtual memory for KV cache | ~50% reduction |
| **Prefix Caching** | Reuse KV cache for common prefixes | System prompt: 100% |
| **Quantized KV Cache** | INT8/FP8 for KV values | 50-75% reduction |
| **Sliding Window** | Limited attention context | Linear memory |
| **MQA/GQA** | Grouped query attention | Architecture-dependent |
## Streaming Response Patterns
### Server-Sent Events (SSE)
```text
Client Server
│ │
│──── GET /v1/chat/completions ──────▶│
│ (stream: true) │
│ │
│◀──── HTTP 200 OK ───────────────────│
│ Content-Type: text/event-stream│
│ │
│◀──── data: {"token": "Hello"} ──────│
│◀──── data: {"token": " world"} ─────│
│◀──── data: {"token": "!"} ──────────│
│◀──── data: [DONE] ──────────────────│
│ │
```
**SSE Benefits:**
- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support
### WebSocket Streaming
```text
Client Server
│ │
│──── WebSocket Upgrade ─────────────▶│
│◀──── 101 Switching Protocols ───────│
│ │
│──── {"prompt": "Hello"} ───────────▶│
│ │
│◀──── {"token": "Hi"} ───────────────│
│◀──── {"token": " there"} ───────────│
│◀──── {"token": "!"} ────────────────│
│◀──── {"done": true} ────────────────│
│ │
```
**WebSocket Benefits:**
- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence
### Streaming Implementation Considerations
| Aspect | SSE | WebSocket |
| ------ | --- | --------- |
| **Reconnection** | Built-in | Manual |
| **Scalability** | Per-request | Connection pool |
| **Load Balancing** | Standard HTTP | Sticky sessions |
| **Firewall/Proxy** | Usually works | May need config |
| **Best For** | One-way streaming | Interactive chat |
## Speculative Decoding
### Concept
```text
Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
│
▼
Large Model: [Verify T1-T5 in one pass] (15ms)
Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗
│
▼
[Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
```
### Speculative Decoding Trade-offs
| Factor | Impact |
| ------ | ------ |
| **Draft model quality** | Higher match rate = more speedup |
| **Draft model size** | Larger = better quality, slower |
| **Speculation depth** | More tokens = higher risk/reward |
| **Verification cost** | Must be < sequential generation |
## Scaling Strategies
### Horizontal Scaling
```text
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Round-robin, Least-connections) │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ vLLM │ │ vLLM │ │ vLLM │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │
└─────────┘ └─────────┘ └─────────┘
```
### Model Parallelism
| Strategy | Description | Use Case |
| -------- | ----------- | -------- |
| **Tensor Parallelism** | Split layers across GPUs | Single large model |
| **Pipeline Parallelism** | Different layers on different GPUs | Very large models |
| **Data Parallelism** | Same model, different batches | High throughput |
```text
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│ Layer N │
│ GPU0 │ GPU1 │ GPU2 │ GPU3 │
│ 25% │ 25% │ 25% │ 25% │
└─────────────────────────────────────────┘
Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
```
## Latency Optimization Checklist
### Pre-deployment
- [ ] Choose appropriate quantization (INT8 for production)
- [ ] Enable continuous batching
- [ ] Configure KV cache size appropriately
- [ ] Set optimal batch size for hardware
- [ ] Enable prefix caching for system prompts
### Runtime
- [ ] Monitor GPU memory utilization
- [ ] Track p50/p95/p99 latencies
- [ ] Measure time-to-first-token (TTFT)
- [ ] Monitor tokens-per-second (TPS)
- [ ] Set appropriate timeouts
### Infrastructure
- [ ] Use fastest available interconnect (NVLink, InfiniBand)
- [ ] Minimize network hops
- [ ] Place inference close to users (edge)
- [ ] Consider dedicated inference hardware
## Cost Optimization
### Cost Drivers
| Factor | Impact | Optimization |
| ------ | ------ | ------------ |
| **GPU hours** | Highest | Quantization, batching |
| **Memory** | High | PagedAttention, KV cache optimization |
| **Network** | Medium | Response compression, edge deployment |
| **Storage** | Low | Model deduplication |
### Cost Estimation Formula
```text
Monthly Cost =
(Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
─────────────────────────────────────────────────────────────────────────────
3600
Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour
Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
```
## Common Patterns
### Multi-model Routing
```text
┌─────────────────────────────────────────────────────────┐
│ Router │
│ • Classify request complexity │
│ • Route to appropriate model │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Small │ │ Medium │ │ Large │
│ Model │ │ Model │ │ Model │
│ (7B) │ │ (13B) │ │ (70B) │
│ Fast │ │ Balanced│ │ Quality │
└─────────┘ └─────────┘ └─────────┘
```
### Caching Strategies
| Cache Type | What to Cache | TTL |
| ---------- | ------------- | --- |
| **Prompt cache** | Common system prompts | Long |
| **KV cache** | Prefix tokens | Session |
| **Response cache** | Exact query matches | Varies |
| **Embedding cache** | Document embeddings | Long |
## Related Skills
- `ml-system-design` - End-to-end ML pipeline design
- `rag-architecture` - Retrieval-augmented generation patterns
- `vector-databases` - Vector search for LLM context
- `ml-inference-optimization` - General inference optimization
- `estimation-techniques` - Capacity planning for LLM systems
## Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
---
## Last Updated
**Date:** 2025-12-26