home / skills / ancoleman / ai-design-components / model-serving

model-serving skill

unsafe

This skill helps deploy and optimize LLM and ML model serving for production using vLLM, BentoML, Triton, LangChain, and RAG workflows.

npx playbooks add skill ancoleman/ai-design-components --skill model-serving

Review the files below or copy the command above to add this skill to your agents.

Files (21)

SKILL.md

13.3 KB

---
name: model-serving
description: LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.
---

# Model Serving

## Purpose

Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications.

## When to Use

- Deploying LLMs for production (self-hosted Llama, Mistral, Qwen)
- Building AI APIs with streaming responses
- Serving traditional ML models (scikit-learn, XGBoost, PyTorch)
- Implementing RAG pipelines with vector databases
- Optimizing inference throughput and latency
- Integrating LLM serving with frontend chat interfaces

## Model Serving Selection

### LLM Serving Engines

**vLLM (Recommended Primary)**
- PagedAttention memory management (20-30x throughput improvement)
- Continuous batching for dynamic request handling
- OpenAI-compatible API endpoints
- Use for: Most self-hosted LLM deployments

**TensorRT-LLM**
- Maximum GPU efficiency (2-8x faster than vLLM)
- Requires model conversion and optimization
- Use for: Production workloads needing absolute maximum throughput

**Ollama**
- Local development without GPUs
- Simple CLI interface
- Use for: Prototyping, laptop development, educational purposes

**Decision Framework:**
```
Self-hosted LLM deployment needed?
├─ Yes, need maximum throughput → vLLM
├─ Yes, need absolute max GPU efficiency → TensorRT-LLM
├─ Yes, local development only → Ollama
└─ No, use managed API (OpenAI, Anthropic) → No serving layer needed
```

### ML Model Serving (Non-LLM)

**BentoML (Recommended)**
- Python-native, easy deployment
- Adaptive batching for throughput
- Multi-framework support (scikit-learn, PyTorch, XGBoost)
- Use for: Most traditional ML model deployments

**Triton Inference Server**
- Multi-model serving on same GPU
- Model ensembles (chain multiple models)
- Use for: NVIDIA GPU optimization, serving 10+ models

### LLM Orchestration

**LangChain**
- General-purpose workflows, agents, RAG
- 100+ integrations (LLMs, vector DBs, tools)
- Use for: Most RAG and agent applications

**LlamaIndex**
- RAG-focused with advanced retrieval strategies
- 100+ data connectors (PDF, Notion, web)
- Use for: RAG is primary use case

## Quick Start Examples

### vLLM Server Setup

```bash
# Install
pip install vllm

# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --dtype auto \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --port 8000
```

**Key Parameters:**
- `--dtype`: Model precision (auto, float16, bfloat16)
- `--max-model-len`: Context window size
- `--gpu-memory-utilization`: GPU memory fraction (0.8-0.95)
- `--tensor-parallel-size`: Number of GPUs for model parallelism

### Streaming Responses (SSE Pattern)

**Backend (FastAPI):**
```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

@app.post("/chat/stream")
async def chat_stream(message: str):
    async def generate():
        stream = client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=512
        )

        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )
```

**Frontend (React):**
```typescript
// Integration with ai-chat skill
const sendMessage = async (message: string) => {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message })
  })

  const reader = response.body!.getReader()
  const decoder = new TextDecoder()

  while (true) {
    const { done, value } = await reader.read()
    if (done) break

    const chunk = decoder.decode(value)
    const lines = chunk.split('\n\n')

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = JSON.parse(line.slice(6))
        if (data.token) {
          setResponse(prev => prev + data.token)
        }
      }
    }
  }
}
```

### BentoML Service

```python
import bentoml
from bentoml.io import JSON
import numpy as np

@bentoml.service(
    resources={"cpu": "2", "memory": "4Gi"},
    traffic={"timeout": 10}
)
class IrisClassifier:
    model_ref = bentoml.models.get("iris_classifier:latest")

    def __init__(self):
        self.model = bentoml.sklearn.load_model(self.model_ref)

    @bentoml.api(batchable=True, max_batch_size=32)
    def classify(self, features: list[dict]) -> list[str]:
        X = np.array([[f['sepal_length'], f['sepal_width'],
                       f['petal_length'], f['petal_width']] for f in features])
        predictions = self.model.predict(X)
        return ['setosa', 'versicolor', 'virginica'][predictions]
```

### LangChain RAG Pipeline

```python
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Qdrant
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load and chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Qdrant.from_documents(
    chunks,
    embeddings,
    url="http://localhost:6333",
    collection_name="docs"
)

# Create retrieval chain
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Query
result = qa_chain({"query": "What is PagedAttention?"})
```

## Performance Optimization

### GPU Memory Estimation

**Rule of thumb for LLMs:**
```
GPU Memory (GB) = Model Parameters (B) × Precision (bytes) × 1.2
```

**Examples:**
- Llama-3.1-8B (FP16): 8B × 2 bytes × 1.2 = 19.2 GB
- Llama-3.1-70B (FP16): 70B × 2 bytes × 1.2 = 168 GB (requires 2-4 A100s)

**Quantization reduces memory:**
- FP16: 2 bytes per parameter
- INT8: 1 byte per parameter (2x memory reduction)
- INT4: 0.5 bytes per parameter (4x memory reduction)

### vLLM Optimization

```bash
# Enable quantization (AWQ for 4-bit)
vllm serve TheBloke/Llama-3.1-8B-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.9

# Multi-GPU deployment (tensor parallelism)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9
```

### Batching Strategies

**Continuous batching (vLLM default):**
- Dynamically adds/removes requests from batch
- Higher throughput than static batching
- No configuration needed

**Adaptive batching (BentoML):**
```python
@bentoml.api(
    batchable=True,
    max_batch_size=32,
    max_latency_ms=1000  # Wait max 1s to fill batch
)
def predict(self, inputs: list[np.ndarray]) -> list[float]:
    # BentoML automatically batches requests
    return self.model.predict(np.array(inputs))
```

## Production Deployment

### Kubernetes Deployment

See `examples/k8s-vllm-deployment/` for complete YAML manifests.

**Key considerations:**
- GPU resource requests: `nvidia.com/gpu: 1`
- Health checks: `/health` endpoint
- Horizontal Pod Autoscaling based on queue depth
- Persistent volume for model caching

### API Gateway Pattern

For production, add rate limiting, authentication, and monitoring:

**Kong Configuration:**
```yaml
services:
  - name: vllm-service
    url: http://vllm-llama-8b:8000
    plugins:
      - name: rate-limiting
        config:
          minute: 60  # 60 requests per minute per API key
      - name: key-auth
      - name: prometheus
```

### Monitoring Metrics

**Essential LLM metrics:**
- Tokens per second (throughput)
- Time to first token (TTFT)
- Inter-token latency
- GPU utilization and memory
- Queue depth

**Prometheus instrumentation:**
```python
from prometheus_client import Counter, Histogram

requests_total = Counter('llm_requests_total', 'Total requests')
tokens_generated = Counter('llm_tokens_generated', 'Total tokens')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')

@app.post("/chat")
async def chat(request):
    requests_total.inc()
    start = time.time()
    response = await generate(request)
    tokens_generated.inc(len(response.tokens))
    request_duration.observe(time.time() - start)
    return response
```

## Integration Patterns

### Frontend (ai-chat) Integration

This skill provides the backend serving layer for the `ai-chat` skill.

**Flow:**
```
Frontend (React) → API Gateway → vLLM Server → GPU Inference
     ↑                                                  ↓
     └─────────── SSE Stream (tokens) ─────────────────┘
```

See `references/streaming-sse.md` for complete implementation patterns.

### RAG with Vector Databases

**Architecture:**
```
User Query → LangChain
              ├─> Vector DB (Qdrant) for retrieval
              ├─> Combine context + query
              └─> LLM (vLLM) for generation
```

See `references/langchain-orchestration.md` and `examples/langchain-rag-qdrant/` for complete patterns.

### Async Inference Queue

For batch processing or non-real-time inference:

```
Client → API → Message Queue (Celery) → Workers (vLLM) → Results DB
```

Useful for:
- Batch document processing
- Background summarization
- Non-interactive workflows

## Benchmarking

Use `scripts/benchmark_inference.py` to measure the deployment:

```bash
python scripts/benchmark_inference.py \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --concurrency 32 \
  --requests 1000
```

**Outputs:**
- Requests per second
- P50/P95/P99 latency
- Tokens per second
- GPU memory usage

## Bundled Resources

**Detailed Guides:**
- `references/vllm.md` - vLLM setup, PagedAttention, optimization
- `references/tgi.md` - Text Generation Inference patterns
- `references/bentoml.md` - BentoML deployment patterns
- `references/langchain-orchestration.md` - LangChain RAG and agents
- `references/inference-optimization.md` - Quantization, batching, GPU tuning

**Working Examples:**
- `examples/vllm-serving/` - Complete vLLM + FastAPI streaming setup
- `examples/ollama-local/` - Local development with Ollama
- `examples/langchain-agents/` - LangChain agent patterns

**Utility Scripts:**
- `scripts/benchmark_inference.py` - Throughput and latency benchmarking
- `scripts/validate_model_config.py` - Validate deployment configurations

## Common Patterns

### Migration from OpenAI API

vLLM provides OpenAI-compatible endpoints for easy migration:

```python
# Before (OpenAI)
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After (vLLM)
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Same API calls work!
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}]
)
```

### Multi-Model Serving

Route requests to different models based on task:

```python
MODEL_ROUTING = {
    "small": "meta-llama/Llama-3.1-8B-Instruct",  # Fast, cheap
    "large": "meta-llama/Llama-3.1-70B-Instruct", # Accurate, expensive
    "code": "codellama/CodeLlama-34b-Instruct"    # Code-specific
}

@app.post("/chat")
async def chat(message: str, task: str = "small"):
    model = MODEL_ROUTING[task]
    # Route to appropriate vLLM instance
```

### Cost Optimization

**Track token usage:**
```python
import tiktoken

def estimate_cost(text: str, model: str, price_per_1k: float):
    encoding = tiktoken.encoding_for_model(model)
    tokens = len(encoding.encode(text))
    return (tokens / 1000) * price_per_1k

# Compare costs
openai_cost = estimate_cost(text, "gpt-4o", 0.005)  # $5 per 1M tokens
self_hosted_cost = 0  # Fixed GPU cost, unlimited tokens
```

## Troubleshooting

**Out of GPU memory:**
- Reduce `--max-model-len`
- Lower `--gpu-memory-utilization` (try 0.8)
- Enable quantization (`--quantization awq`)
- Use smaller model variant

**Low throughput:**
- Increase `--gpu-memory-utilization` (try 0.95)
- Enable continuous batching (vLLM default)
- Check GPU utilization (should be >80%)
- Consider tensor parallelism for multi-GPU

**High latency:**
- Reduce batch size if using static batching
- Check network latency to GPU server
- Profile with `scripts/benchmark_inference.py`

## Next Steps

1. **Local Development**: Start with `examples/ollama-local/` for GPU-free testing
2. **Production Setup**: Deploy vLLM with `examples/vllm-serving/`
3. **RAG Integration**: Add vector DB with `examples/langchain-rag-qdrant/`
4. **Kubernetes**: Scale with `examples/k8s-vllm-deployment/`
5. **Monitoring**: Add metrics with Prometheus and Grafana

Overview

This skill provides practical guidance and ready-to-run patterns for deploying LLMs and traditional ML models in production. It covers LLM serving engines (vLLM, TensorRT-LLM, Ollama), ML servers (BentoML, Triton), orchestration tools (LangChain, LlamaIndex), and streaming/monitoring patterns to optimize inference. Use it to design reliable, low-latency AI APIs and integrate them with frontend chat experiences.

How this skill works

The skill explains which serving engine to choose based on throughput, GPU efficiency, and development stage, and provides concrete setup examples for vLLM, TensorRT-LLM, Ollama, BentoML, and Triton. It includes streaming patterns (SSE/Web streams) for token-by-token delivery, batching strategies (continuous and adaptive), GPU memory estimation, quantization tips, and production deployment considerations like Kubernetes manifests, API gateway configuration, and monitoring metrics.

When to use it

Self-host LLMs for production inference and OpenAI-compatible APIs
Build AI APIs that stream tokens to frontend chat interfaces
Serve traditional ML models with batching and resource controls
Implement RAG pipelines with vector DBs and LangChain/LlamaIndex
Maximize GPU throughput or minimize inference latency

Best practices

Start with vLLM for most self-hosted LLM needs; use TensorRT-LLM only when absolute GPU efficiency is required
Enable streaming (SSE) to improve perceived latency and deliver tokens incrementally to UIs
Use quantization (INT8/4-bit) to reduce memory and enable larger models on limited GPUs
Instrument tokens/sec, time-to-first-token, inter-token latency, and GPU memory with Prometheus
Apply adaptive or continuous batching to increase throughput while bounding latency

Example use cases

Deploy a production chat API with vLLM, FastAPI backend, and React SSE client for streaming responses
Run high-throughput inference for latency-sensitive services using TensorRT-LLM and multi-GPU tensor parallelism
Prototype locally on laptop with Ollama, then migrate to vLLM for production
Serve classical ML models (scikit-learn, XGBoost) via BentoML with built-in batching and resource constraints
Build a LangChain RAG pipeline that retrieves from Qdrant and generates answers with a self-hosted LLM

FAQ

Which serving engine should I choose for general self-hosted LLMs?

Use vLLM as the default: it offers PagedAttention, continuous batching, and OpenAI-compatible endpoints for most deployments.

When should I use TensorRT-LLM?

Choose TensorRT-LLM when you need absolute maximum GPU throughput and are willing to run model conversion and optimization steps.

How do I reduce out-of-memory errors?

Lower context length, enable quantization, reduce gpu memory utilization, or switch to a smaller model; consider tensor parallelism for very large models.