home / skills / phrazzld / claude-config / llm-gateway-routing

llm-gateway-routing skill

unsafe

This skill helps configure multi-model access, routing, fallbacks, and ab testing for LLMs using OpenRouter and LiteLLM to improve reliability and cost.

npx playbooks add skill phrazzld/claude-config --skill llm-gateway-routing

Review the files below or copy the command above to add this skill to your agents.

Files (8)

SKILL.md

10.4 KB

---
name: llm-gateway-routing
description: |
  LLM gateway and routing configuration using OpenRouter and LiteLLM.
  Invoke when:
  - Setting up multi-model access (OpenRouter, LiteLLM)
  - Configuring model fallbacks and reliability
  - Implementing cost-based or latency-based routing
  - A/B testing different models
  - Self-hosting an LLM proxy
  Keywords: openrouter, litellm, llm gateway, model routing, fallback, A/B testing
effort: high
---

# LLM Gateway & Routing

Configure multi-model access, fallbacks, cost optimization, and A/B testing.

## Why Use a Gateway?

**Without gateway:**
- Vendor lock-in (one provider)
- No fallbacks (provider down = app down)
- Hard to A/B test models
- Scattered API keys and configs

**With gateway:**
- Single API for 400+ models
- Automatic fallbacks
- Easy model switching
- Unified cost tracking

## Quick Decision

| Need | Solution |
|------|----------|
| Fastest setup, multi-model | **OpenRouter** |
| Full control, self-hosted | **LiteLLM** |
| Observability + routing | **Helicone** |
| Enterprise, guardrails | **Portkey** |

## OpenRouter (Recommended)

### Why OpenRouter

- **400+ models**: OpenAI, Anthropic, Google, Meta, Mistral, and more
- **Single API**: One key for all providers
- **Automatic fallbacks**: Built-in reliability
- **A/B testing**: Easy model comparison
- **Cost tracking**: Unified billing dashboard
- **Free credits**: $1 free to start

### Setup

```bash
# 1. Sign up at openrouter.ai
# 2. Get API key from dashboard
# 3. Add to .env:
OPENROUTER_API_KEY=sk-or-v1-...
```

### Basic Usage

```typescript
// Using fetch
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-3-5-sonnet',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});
```

### With Vercel AI SDK (Recommended)

```typescript
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const openrouter = createOpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const { text } = await generateText({
  model: openrouter("anthropic/claude-3-5-sonnet"),
  prompt: "Explain quantum computing",
});
```

### Model IDs

```typescript
// Format: provider/model-name
const models = {
  // Anthropic
  claude35Sonnet: "anthropic/claude-3-5-sonnet",
  claudeHaiku: "anthropic/claude-3-5-haiku",

  // OpenAI
  gpt4o: "openai/gpt-4o",
  gpt4oMini: "openai/gpt-4o-mini",

  // Google
  geminiPro: "google/gemini-pro-1.5",
  geminiFlash: "google/gemini-flash-1.5",

  // Meta
  llama3: "meta-llama/llama-3.1-70b-instruct",

  // Auto (OpenRouter picks best)
  auto: "openrouter/auto",
};
```

### Fallback Chains

```typescript
// Define fallback order
const modelChain = [
  "anthropic/claude-3-5-sonnet",   // Primary
  "openai/gpt-4o",                  // Fallback 1
  "google/gemini-pro-1.5",          // Fallback 2
];

async function callWithFallback(messages: Message[]) {
  for (const model of modelChain) {
    try {
      return await openrouter.chat({ model, messages });
    } catch (error) {
      console.log(`${model} failed, trying next...`);
    }
  }
  throw new Error("All models failed");
}
```

### Cost Routing

```typescript
// Route based on query complexity
function selectModel(query: string): string {
  const complexity = analyzeComplexity(query);

  if (complexity === "simple") {
    // Simple queries → cheap model
    return "openai/gpt-4o-mini";  // ~$0.15/1M tokens
  } else if (complexity === "medium") {
    // Medium → balanced
    return "google/gemini-flash-1.5";  // ~$0.075/1M tokens
  } else {
    // Complex → best quality
    return "anthropic/claude-3-5-sonnet";  // ~$3/1M tokens
  }
}

function analyzeComplexity(query: string): "simple" | "medium" | "complex" {
  // Simple heuristics
  if (query.length < 50) return "simple";
  if (query.includes("explain") || query.includes("analyze")) return "complex";
  return "medium";
}
```

### A/B Testing

```typescript
// Random assignment
function getModel(userId: string): string {
  const hash = userId.charCodeAt(0) % 100;

  if (hash < 50) {
    return "anthropic/claude-3-5-sonnet";  // 50%
  } else {
    return "openai/gpt-4o";  // 50%
  }
}

// Track which model was used
const model = getModel(userId);
const response = await openrouter.chat({ model, messages });
await analytics.track("llm_call", { model, userId, latency, cost });
```

## LiteLLM (Self-Hosted)

### Why LiteLLM

- **Self-hosted**: Full control over data
- **100+ providers**: Same coverage as OpenRouter
- **Load balancing**: Distribute across providers
- **Cost tracking**: Built-in spend management
- **Caching**: Redis or in-memory
- **Rate limiting**: Per-user limits

### Setup

```bash
# Install
pip install litellm[proxy]

# Run proxy
litellm --config config.yaml

# Use as OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:4000
```

### Configuration

```yaml
# config.yaml
model_list:
  # Claude models
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
      api_key: sk-ant-...

  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-...

  # Load balanced (multiple providers)
  - model_name: balanced
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
    litellm_params:
      model: openai/gpt-4o
    # Requests distributed across both

# General settings
general_settings:
  master_key: sk-master-...
  database_url: postgresql://...

# Routing
router_settings:
  routing_strategy: simple-shuffle  # or latency-based-routing
  num_retries: 3
  timeout: 30

# Rate limiting
litellm_settings:
  max_budget: 100  # $100/month
  budget_duration: monthly
```

### Fallbacks in LiteLLM

```yaml
model_list:
  - model_name: primary
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
    fallbacks:
      - model_name: fallback-1
        litellm_params:
          model: openai/gpt-4o
      - model_name: fallback-2
        litellm_params:
          model: google/gemini-pro
```

### Usage

```typescript
// Use like OpenAI SDK
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4000",
  apiKey: "sk-master-...",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet",  // Maps to configured model
  messages: [{ role: "user", content: "Hello!" }],
});
```

## Routing Strategies

### 1. Cost-Based Routing

```typescript
const costTiers = {
  cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"],
  balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"],
  premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"],
};

function routeByCost(budget: "cheap" | "balanced" | "premium"): string {
  const models = costTiers[budget];
  return models[Math.floor(Math.random() * models.length)];
}
```

### 2. Latency-Based Routing

```typescript
// Track latency per model
const latencyStats: Record<string, number[]> = {};

function routeByLatency(): string {
  const avgLatencies = Object.entries(latencyStats)
    .map(([model, times]) => ({
      model,
      avg: times.reduce((a, b) => a + b, 0) / times.length,
    }))
    .sort((a, b) => a.avg - b.avg);

  return avgLatencies[0].model;
}

// Update after each call
function recordLatency(model: string, latencyMs: number) {
  if (!latencyStats[model]) latencyStats[model] = [];
  latencyStats[model].push(latencyMs);
  // Keep last 100 samples
  if (latencyStats[model].length > 100) {
    latencyStats[model].shift();
  }
}
```

### 3. Task-Based Routing

```typescript
const taskModels = {
  coding: "anthropic/claude-3-5-sonnet",  // Best for code
  reasoning: "openai/o1-preview",          // Best for logic
  creative: "anthropic/claude-3-5-sonnet", // Best for writing
  simple: "openai/gpt-4o-mini",            // Cheap and fast
  multimodal: "google/gemini-pro-1.5",     // Vision + text
};

function routeByTask(task: keyof typeof taskModels): string {
  return taskModels[task];
}
```

### 4. Hybrid Routing

```typescript
interface RoutingConfig {
  task: string;
  maxCost: number;
  maxLatency: number;
}

function hybridRoute(config: RoutingConfig): string {
  // Filter by cost
  const affordable = models.filter(m => m.cost <= config.maxCost);

  // Filter by latency
  const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);

  // Select best for task
  const taskScores = fast.map(m => ({
    model: m.id,
    score: getTaskScore(m.id, config.task),
  }));

  return taskScores.sort((a, b) => b.score - a.score)[0].model;
}
```

## Best Practices

### 1. Always Have Fallbacks

```typescript
// Bad: Single point of failure
const response = await openai.chat({ model: "gpt-4o", messages });

// Good: Fallback chain
const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"];
for (const model of models) {
  try {
    return await gateway.chat({ model, messages });
  } catch (e) {
    continue;
  }
}
```

### 2. Pin Model Versions

```typescript
// Bad: Model can change
const model = "gpt-4";

// Good: Pinned version
const model = "openai/gpt-4-0125-preview";
```

### 3. Track Costs

```typescript
// Log every call
async function trackedCall(model: string, messages: Message[]) {
  const start = Date.now();
  const response = await gateway.chat({ model, messages });
  const latency = Date.now() - start;

  await analytics.track("llm_call", {
    model,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cost: calculateCost(model, response.usage),
    latency,
  });

  return response;
}
```

### 4. Set Token Limits

```typescript
// Prevent runaway costs
const response = await gateway.chat({
  model,
  messages,
  max_tokens: 500,  // Limit output length
});
```

### 5. Use Caching

```typescript
// LiteLLM caching
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour
```

## References

- `references/openrouter-guide.md` - OpenRouter deep dive
- `references/litellm-guide.md` - LiteLLM self-hosting
- `references/routing-strategies.md` - Advanced routing patterns
- `references/alternatives.md` - Helicone, Portkey, etc.

## Templates

- `templates/openrouter-config.ts` - TypeScript OpenRouter setup
- `templates/litellm-config.yaml` - LiteLLM proxy config
- `templates/fallback-chain.ts` - Fallback implementation

Overview

This skill configures an LLM gateway and model routing layer using OpenRouter and LiteLLM. It helps you unify multi-model access, implement fallbacks, cost- or latency-based routing, A/B testing, and self-hosted proxy capabilities. The goal is reliable, observable, and cost-aware LLM access for production applications.

How this skill works

The skill wires a single gateway API in front of many models (OpenAI, Anthropic, Google, Meta, etc.) and implements routing policies that select a model per request. It supports OpenRouter for fast multi-provider setup and LiteLLM for self-hosted control, including fallback chains, load balancing, caching, rate limits, and analytics hooks. Routing strategies include cost-based, latency-based, task-based, and hybrid decision logic.

When to use it

You need unified access to many model providers without vendor lock-in
You want automatic fallbacks and higher availability for production LLM calls
You are optimizing for cost or latency across different query types
You want to run a self-hosted LLM proxy for data control and auditability
You plan to A/B test models or route by user/task for experimentation

Best practices

Always configure fallback chains to avoid single points of failure
Pin model versions to avoid unexpected behavior when providers update models
Track call-level metrics (tokens, latency, cost) and ingest into analytics
Set token and response limits to prevent runaway costs
Use caching and rate limiting in the gateway to reduce spend and improve latency

Example use cases

Route short, simple queries to a cheap fast model and complex prompts to a premium model
Deploy a self-hosted LiteLLM proxy with Redis caching and per-user rate limits
Use OpenRouter to A/B test Claude vs GPT variants and capture model-specific metrics
Implement latency-based routing to pick the lowest-latency provider for interactive apps
Combine task-based and cost constraints in a hybrid router for production workloads

FAQ

When should I pick OpenRouter vs LiteLLM?

Choose OpenRouter for fastest multi-provider setup and centralized billing. Choose LiteLLM when you need self-hosting, full data control, or custom deployment and routing logic.

How do I ensure calls stay within budget?

Implement cost-based routing, set token limits, track per-call costs in analytics, and enforce monthly budgets or quotas at the gateway level.