home / skills / melodic-software / claude-code-plugins / llm-serving-patterns

llm-serving-patterns skill

/plugins/systems-design/skills/llm-serving-patterns

This is most likely a fork of the llm-serving-patterns skill from benchflow-ai
npx playbooks add skill melodic-software/claude-code-plugins --skill llm-serving-patterns

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
20.0 KB
---
name: llm-serving-patterns
description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
allowed-tools: Read, Glob, Grep
---

# LLM Serving Patterns

## When to Use This Skill

Use this skill when:

- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively

**Keywords:** LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

## LLM Serving Architecture Overview

```text
┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Serving Stack                           │
├─────────────────────────────────────────────────────────────────────┤
│  Clients (API, Chat UI, Agents)                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              Load Balancer / API Gateway                     │   │
│  │  • Rate limiting  • Authentication  • Request routing        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                   Inference Server                           │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │   │
│  │  │  Request    │  │  Batching   │  │  KV Cache           │  │   │
│  │  │  Queue      │──▶│  Engine     │──▶│  Management        │  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  │   │
│  │       │                                      │               │   │
│  │       ▼                                      ▼               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │              Model Execution Engine                  │    │   │
│  │  │  • Tensor operations  • Attention  • Token sampling │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    GPU/TPU Cluster                           │   │
│  │  • Model sharding  • Tensor parallelism  • Pipeline parallel │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
```

## Serving Framework Comparison

| Framework | Strengths | Best For | Considerations |
| --------- | --------- | -------- | -------------- |
| **vLLM** | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| **TGI (Text Generation Inference)** | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| **TensorRT-LLM** | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| **Triton Inference Server** | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| **Ollama** | Simple local deployment | Development, edge deployment | Limited scaling features |
| **llama.cpp** | CPU inference, quantization | Resource-constrained, edge | C++ integration required |

### Framework Selection Decision Tree

```text
Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
    └── Need high throughput with many concurrent users?
        ├── Yes → vLLM (PagedAttention)
        └── No
            └── Need enterprise features + HF integration?
                ├── Yes → TGI
                └── No
                    └── Simple local/edge deployment?
                        ├── Yes → Ollama or llama.cpp
                        └── No → vLLM (general purpose)
```

## Quantization Techniques

### Precision Levels

| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
| --------- | ---- | ---------------- | -------------- | -------- |
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |

### Quantization Methods

| Method | Description | Quality | Speed |
| ------ | ----------- | ------- | ----- |
| **PTQ (Post-Training Quantization)** | Quantize after training, no retraining | Good | Fast to apply |
| **QAT (Quantization-Aware Training)** | Simulate quantization during training | Better | Requires training |
| **GPTQ** | One-shot weight quantization | Very good | Moderate |
| **AWQ (Activation-aware Weight Quantization)** | Preserves salient weights | Excellent | Moderate |
| **GGUF/GGML** | llama.cpp format, CPU-optimized | Good | Very fast inference |
| **SmoothQuant** | Migrates difficulty to weights | Excellent | Moderate |

### Quantization Selection

```text
Quality vs. Efficiency Trade-off:

Quality ────────────────────────────────────────────▶ Efficiency
   │                                                      │
   │  FP32    FP16    INT8+AWQ   INT8+GPTQ   INT4   INT2  │
   │   ○───────○────────○──────────○──────────○──────○    │
   │   │       │        │          │          │      │    │
   │  Best   Great    Good      Good       Fair   Poor   │
   │                                                      │
```

## Batching Strategies

### Static Batching

```text
Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50]  ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80]  ─┘

Problem: Short requests wait for long ones (head-of-line blocking)
```

### Continuous Batching (Preferred)

```text
Time ──────────────────────────────────────────────────────────▶

Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]

• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization
```

### Batching Parameters

| Parameter | Description | Trade-off |
| --------- | ----------- | --------- |
| `max_batch_size` | Maximum concurrent requests | Memory vs. throughput |
| `max_waiting_tokens` | Tokens before forcing batch | Latency vs. throughput |
| `max_num_seqs` | Maximum sequences in batch | Memory vs. concurrency |

## KV Cache Management

### The KV Cache Problem

```text
Attention: Q × K^T × V

For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)

Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory
```

### PagedAttention (vLLM Innovation)

```text
Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed)   │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed)   │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE                  │
└──────────────────────────────────────────┘

PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │  ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput
```

### KV Cache Optimization Strategies

| Strategy | Description | Memory Savings |
| -------- | ----------- | -------------- |
| **Paged Attention** | Virtual memory for KV cache | ~50% reduction |
| **Prefix Caching** | Reuse KV cache for common prefixes | System prompt: 100% |
| **Quantized KV Cache** | INT8/FP8 for KV values | 50-75% reduction |
| **Sliding Window** | Limited attention context | Linear memory |
| **MQA/GQA** | Grouped query attention | Architecture-dependent |

## Streaming Response Patterns

### Server-Sent Events (SSE)

```text
Client                                Server
   │                                     │
   │──── GET /v1/chat/completions ──────▶│
   │      (stream: true)                 │
   │                                     │
   │◀──── HTTP 200 OK ───────────────────│
   │      Content-Type: text/event-stream│
   │                                     │
   │◀──── data: {"token": "Hello"} ──────│
   │◀──── data: {"token": " world"} ─────│
   │◀──── data: {"token": "!"} ──────────│
   │◀──── data: [DONE] ──────────────────│
   │                                     │
```

**SSE Benefits:**

- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support

### WebSocket Streaming

```text
Client                                Server
   │                                     │
   │──── WebSocket Upgrade ─────────────▶│
   │◀──── 101 Switching Protocols ───────│
   │                                     │
   │──── {"prompt": "Hello"} ───────────▶│
   │                                     │
   │◀──── {"token": "Hi"} ───────────────│
   │◀──── {"token": " there"} ───────────│
   │◀──── {"token": "!"} ────────────────│
   │◀──── {"done": true} ────────────────│
   │                                     │
```

**WebSocket Benefits:**

- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence

### Streaming Implementation Considerations

| Aspect | SSE | WebSocket |
| ------ | --- | --------- |
| **Reconnection** | Built-in | Manual |
| **Scalability** | Per-request | Connection pool |
| **Load Balancing** | Standard HTTP | Sticky sessions |
| **Firewall/Proxy** | Usually works | May need config |
| **Best For** | One-way streaming | Interactive chat |

## Speculative Decoding

### Concept

```text
Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
             10ms   10ms   10ms   10ms   10ms = 50ms total

Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
                      │
                      ▼
Large Model: [Verify T1-T5 in one pass] (15ms)
             Accept: T1, T2, T3 ✓  Reject: T4, T5 ✗
                      │
                      ▼
             [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)
```

### Speculative Decoding Trade-offs

| Factor | Impact |
| ------ | ------ |
| **Draft model quality** | Higher match rate = more speedup |
| **Draft model size** | Larger = better quality, slower |
| **Speculation depth** | More tokens = higher risk/reward |
| **Verification cost** | Must be < sequential generation |

## Scaling Strategies

### Horizontal Scaling

```text
┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
│         (Round-robin, Least-connections)                │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ vLLM    │    │ vLLM    │    │ vLLM    │
    │ Node 1  │    │ Node 2  │    │ Node 3  │
    │ (GPU×4) │    │ (GPU×4) │    │ (GPU×4) │
    └─────────┘    └─────────┘    └─────────┘
```

### Model Parallelism

| Strategy | Description | Use Case |
| -------- | ----------- | -------- |
| **Tensor Parallelism** | Split layers across GPUs | Single large model |
| **Pipeline Parallelism** | Different layers on different GPUs | Very large models |
| **Data Parallelism** | Same model, different batches | High throughput |

```text
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│              Layer N                     │
│  GPU0   │   GPU1   │   GPU2   │   GPU3  │
│  25%    │   25%    │   25%    │   25%   │
└─────────────────────────────────────────┘

Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
```

## Latency Optimization Checklist

### Pre-deployment

- [ ] Choose appropriate quantization (INT8 for production)
- [ ] Enable continuous batching
- [ ] Configure KV cache size appropriately
- [ ] Set optimal batch size for hardware
- [ ] Enable prefix caching for system prompts

### Runtime

- [ ] Monitor GPU memory utilization
- [ ] Track p50/p95/p99 latencies
- [ ] Measure time-to-first-token (TTFT)
- [ ] Monitor tokens-per-second (TPS)
- [ ] Set appropriate timeouts

### Infrastructure

- [ ] Use fastest available interconnect (NVLink, InfiniBand)
- [ ] Minimize network hops
- [ ] Place inference close to users (edge)
- [ ] Consider dedicated inference hardware

## Cost Optimization

### Cost Drivers

| Factor | Impact | Optimization |
| ------ | ------ | ------------ |
| **GPU hours** | Highest | Quantization, batching |
| **Memory** | High | PagedAttention, KV cache optimization |
| **Network** | Medium | Response compression, edge deployment |
| **Storage** | Low | Model deduplication |

### Cost Estimation Formula

```text
Monthly Cost =
  (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
  ─────────────────────────────────────────────────────────────────────────────
                                    3600

Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour

Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
```

## Common Patterns

### Multi-model Routing

```text
┌─────────────────────────────────────────────────────────┐
│                     Router                              │
│  • Classify request complexity                          │
│  • Route to appropriate model                           │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ Small   │    │ Medium  │    │ Large   │
    │ Model   │    │ Model   │    │ Model   │
    │ (7B)    │    │ (13B)   │    │ (70B)   │
    │ Fast    │    │ Balanced│    │ Quality │
    └─────────┘    └─────────┘    └─────────┘
```

### Caching Strategies

| Cache Type | What to Cache | TTL |
| ---------- | ------------- | --- |
| **Prompt cache** | Common system prompts | Long |
| **KV cache** | Prefix tokens | Session |
| **Response cache** | Exact query matches | Varies |
| **Embedding cache** | Document embeddings | Long |

## Related Skills

- `ml-system-design` - End-to-end ML pipeline design
- `rag-architecture` - Retrieval-augmented generation patterns
- `vector-databases` - Vector search for LLM context
- `ml-inference-optimization` - General inference optimization
- `estimation-techniques` - Capacity planning for LLM systems

## Version History

- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

---

## Last Updated

**Date:** 2025-12-26