home / skills / phrazzld / claude-config / llm-gateway-routing

llm-gateway-routing skill

/skills/llm-gateway-routing

This skill helps configure multi-model access, routing, fallbacks, and ab testing for LLMs using OpenRouter and LiteLLM to improve reliability and cost.

npx playbooks add skill phrazzld/claude-config --skill llm-gateway-routing

Review the files below or copy the command above to add this skill to your agents.

Files (8)
SKILL.md
10.4 KB
---
name: llm-gateway-routing
description: |
  LLM gateway and routing configuration using OpenRouter and LiteLLM.
  Invoke when:
  - Setting up multi-model access (OpenRouter, LiteLLM)
  - Configuring model fallbacks and reliability
  - Implementing cost-based or latency-based routing
  - A/B testing different models
  - Self-hosting an LLM proxy
  Keywords: openrouter, litellm, llm gateway, model routing, fallback, A/B testing
effort: high
---

# LLM Gateway & Routing

Configure multi-model access, fallbacks, cost optimization, and A/B testing.

## Why Use a Gateway?

**Without gateway:**
- Vendor lock-in (one provider)
- No fallbacks (provider down = app down)
- Hard to A/B test models
- Scattered API keys and configs

**With gateway:**
- Single API for 400+ models
- Automatic fallbacks
- Easy model switching
- Unified cost tracking

## Quick Decision

| Need | Solution |
|------|----------|
| Fastest setup, multi-model | **OpenRouter** |
| Full control, self-hosted | **LiteLLM** |
| Observability + routing | **Helicone** |
| Enterprise, guardrails | **Portkey** |

## OpenRouter (Recommended)

### Why OpenRouter

- **400+ models**: OpenAI, Anthropic, Google, Meta, Mistral, and more
- **Single API**: One key for all providers
- **Automatic fallbacks**: Built-in reliability
- **A/B testing**: Easy model comparison
- **Cost tracking**: Unified billing dashboard
- **Free credits**: $1 free to start

### Setup

```bash
# 1. Sign up at openrouter.ai
# 2. Get API key from dashboard
# 3. Add to .env:
OPENROUTER_API_KEY=sk-or-v1-...
```

### Basic Usage

```typescript
// Using fetch
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-3-5-sonnet',
    messages: [{ role: 'user', content: 'Hello!' }],
  }),
});
```

### With Vercel AI SDK (Recommended)

```typescript
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";

const openrouter = createOpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const { text } = await generateText({
  model: openrouter("anthropic/claude-3-5-sonnet"),
  prompt: "Explain quantum computing",
});
```

### Model IDs

```typescript
// Format: provider/model-name
const models = {
  // Anthropic
  claude35Sonnet: "anthropic/claude-3-5-sonnet",
  claudeHaiku: "anthropic/claude-3-5-haiku",

  // OpenAI
  gpt4o: "openai/gpt-4o",
  gpt4oMini: "openai/gpt-4o-mini",

  // Google
  geminiPro: "google/gemini-pro-1.5",
  geminiFlash: "google/gemini-flash-1.5",

  // Meta
  llama3: "meta-llama/llama-3.1-70b-instruct",

  // Auto (OpenRouter picks best)
  auto: "openrouter/auto",
};
```

### Fallback Chains

```typescript
// Define fallback order
const modelChain = [
  "anthropic/claude-3-5-sonnet",   // Primary
  "openai/gpt-4o",                  // Fallback 1
  "google/gemini-pro-1.5",          // Fallback 2
];

async function callWithFallback(messages: Message[]) {
  for (const model of modelChain) {
    try {
      return await openrouter.chat({ model, messages });
    } catch (error) {
      console.log(`${model} failed, trying next...`);
    }
  }
  throw new Error("All models failed");
}
```

### Cost Routing

```typescript
// Route based on query complexity
function selectModel(query: string): string {
  const complexity = analyzeComplexity(query);

  if (complexity === "simple") {
    // Simple queries → cheap model
    return "openai/gpt-4o-mini";  // ~$0.15/1M tokens
  } else if (complexity === "medium") {
    // Medium → balanced
    return "google/gemini-flash-1.5";  // ~$0.075/1M tokens
  } else {
    // Complex → best quality
    return "anthropic/claude-3-5-sonnet";  // ~$3/1M tokens
  }
}

function analyzeComplexity(query: string): "simple" | "medium" | "complex" {
  // Simple heuristics
  if (query.length < 50) return "simple";
  if (query.includes("explain") || query.includes("analyze")) return "complex";
  return "medium";
}
```

### A/B Testing

```typescript
// Random assignment
function getModel(userId: string): string {
  const hash = userId.charCodeAt(0) % 100;

  if (hash < 50) {
    return "anthropic/claude-3-5-sonnet";  // 50%
  } else {
    return "openai/gpt-4o";  // 50%
  }
}

// Track which model was used
const model = getModel(userId);
const response = await openrouter.chat({ model, messages });
await analytics.track("llm_call", { model, userId, latency, cost });
```

## LiteLLM (Self-Hosted)

### Why LiteLLM

- **Self-hosted**: Full control over data
- **100+ providers**: Same coverage as OpenRouter
- **Load balancing**: Distribute across providers
- **Cost tracking**: Built-in spend management
- **Caching**: Redis or in-memory
- **Rate limiting**: Per-user limits

### Setup

```bash
# Install
pip install litellm[proxy]

# Run proxy
litellm --config config.yaml

# Use as OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:4000
```

### Configuration

```yaml
# config.yaml
model_list:
  # Claude models
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
      api_key: sk-ant-...

  # OpenAI models
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-...

  # Load balanced (multiple providers)
  - model_name: balanced
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
    litellm_params:
      model: openai/gpt-4o
    # Requests distributed across both

# General settings
general_settings:
  master_key: sk-master-...
  database_url: postgresql://...

# Routing
router_settings:
  routing_strategy: simple-shuffle  # or latency-based-routing
  num_retries: 3
  timeout: 30

# Rate limiting
litellm_settings:
  max_budget: 100  # $100/month
  budget_duration: monthly
```

### Fallbacks in LiteLLM

```yaml
model_list:
  - model_name: primary
    litellm_params:
      model: anthropic/claude-3-5-sonnet-latest
    fallbacks:
      - model_name: fallback-1
        litellm_params:
          model: openai/gpt-4o
      - model_name: fallback-2
        litellm_params:
          model: google/gemini-pro
```

### Usage

```typescript
// Use like OpenAI SDK
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:4000",
  apiKey: "sk-master-...",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet",  // Maps to configured model
  messages: [{ role: "user", content: "Hello!" }],
});
```

## Routing Strategies

### 1. Cost-Based Routing

```typescript
const costTiers = {
  cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"],
  balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"],
  premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"],
};

function routeByCost(budget: "cheap" | "balanced" | "premium"): string {
  const models = costTiers[budget];
  return models[Math.floor(Math.random() * models.length)];
}
```

### 2. Latency-Based Routing

```typescript
// Track latency per model
const latencyStats: Record<string, number[]> = {};

function routeByLatency(): string {
  const avgLatencies = Object.entries(latencyStats)
    .map(([model, times]) => ({
      model,
      avg: times.reduce((a, b) => a + b, 0) / times.length,
    }))
    .sort((a, b) => a.avg - b.avg);

  return avgLatencies[0].model;
}

// Update after each call
function recordLatency(model: string, latencyMs: number) {
  if (!latencyStats[model]) latencyStats[model] = [];
  latencyStats[model].push(latencyMs);
  // Keep last 100 samples
  if (latencyStats[model].length > 100) {
    latencyStats[model].shift();
  }
}
```

### 3. Task-Based Routing

```typescript
const taskModels = {
  coding: "anthropic/claude-3-5-sonnet",  // Best for code
  reasoning: "openai/o1-preview",          // Best for logic
  creative: "anthropic/claude-3-5-sonnet", // Best for writing
  simple: "openai/gpt-4o-mini",            // Cheap and fast
  multimodal: "google/gemini-pro-1.5",     // Vision + text
};

function routeByTask(task: keyof typeof taskModels): string {
  return taskModels[task];
}
```

### 4. Hybrid Routing

```typescript
interface RoutingConfig {
  task: string;
  maxCost: number;
  maxLatency: number;
}

function hybridRoute(config: RoutingConfig): string {
  // Filter by cost
  const affordable = models.filter(m => m.cost <= config.maxCost);

  // Filter by latency
  const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);

  // Select best for task
  const taskScores = fast.map(m => ({
    model: m.id,
    score: getTaskScore(m.id, config.task),
  }));

  return taskScores.sort((a, b) => b.score - a.score)[0].model;
}
```

## Best Practices

### 1. Always Have Fallbacks

```typescript
// Bad: Single point of failure
const response = await openai.chat({ model: "gpt-4o", messages });

// Good: Fallback chain
const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"];
for (const model of models) {
  try {
    return await gateway.chat({ model, messages });
  } catch (e) {
    continue;
  }
}
```

### 2. Pin Model Versions

```typescript
// Bad: Model can change
const model = "gpt-4";

// Good: Pinned version
const model = "openai/gpt-4-0125-preview";
```

### 3. Track Costs

```typescript
// Log every call
async function trackedCall(model: string, messages: Message[]) {
  const start = Date.now();
  const response = await gateway.chat({ model, messages });
  const latency = Date.now() - start;

  await analytics.track("llm_call", {
    model,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cost: calculateCost(model, response.usage),
    latency,
  });

  return response;
}
```

### 4. Set Token Limits

```typescript
// Prevent runaway costs
const response = await gateway.chat({
  model,
  messages,
  max_tokens: 500,  // Limit output length
});
```

### 5. Use Caching

```typescript
// LiteLLM caching
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 3600  # 1 hour
```

## References

- `references/openrouter-guide.md` - OpenRouter deep dive
- `references/litellm-guide.md` - LiteLLM self-hosting
- `references/routing-strategies.md` - Advanced routing patterns
- `references/alternatives.md` - Helicone, Portkey, etc.

## Templates

- `templates/openrouter-config.ts` - TypeScript OpenRouter setup
- `templates/litellm-config.yaml` - LiteLLM proxy config
- `templates/fallback-chain.ts` - Fallback implementation

Overview

This skill configures an LLM gateway and model routing layer using OpenRouter and LiteLLM. It helps you unify multi-model access, implement fallbacks, cost- or latency-based routing, A/B testing, and self-hosted proxy capabilities. The goal is reliable, observable, and cost-aware LLM access for production applications.

How this skill works

The skill wires a single gateway API in front of many models (OpenAI, Anthropic, Google, Meta, etc.) and implements routing policies that select a model per request. It supports OpenRouter for fast multi-provider setup and LiteLLM for self-hosted control, including fallback chains, load balancing, caching, rate limits, and analytics hooks. Routing strategies include cost-based, latency-based, task-based, and hybrid decision logic.

When to use it

  • You need unified access to many model providers without vendor lock-in
  • You want automatic fallbacks and higher availability for production LLM calls
  • You are optimizing for cost or latency across different query types
  • You want to run a self-hosted LLM proxy for data control and auditability
  • You plan to A/B test models or route by user/task for experimentation

Best practices

  • Always configure fallback chains to avoid single points of failure
  • Pin model versions to avoid unexpected behavior when providers update models
  • Track call-level metrics (tokens, latency, cost) and ingest into analytics
  • Set token and response limits to prevent runaway costs
  • Use caching and rate limiting in the gateway to reduce spend and improve latency

Example use cases

  • Route short, simple queries to a cheap fast model and complex prompts to a premium model
  • Deploy a self-hosted LiteLLM proxy with Redis caching and per-user rate limits
  • Use OpenRouter to A/B test Claude vs GPT variants and capture model-specific metrics
  • Implement latency-based routing to pick the lowest-latency provider for interactive apps
  • Combine task-based and cost constraints in a hybrid router for production workloads

FAQ

When should I pick OpenRouter vs LiteLLM?

Choose OpenRouter for fastest multi-provider setup and centralized billing. Choose LiteLLM when you need self-hosting, full data control, or custom deployment and routing logic.

How do I ensure calls stay within budget?

Implement cost-based routing, set token limits, track per-call costs in analytics, and enforce monthly budgets or quotas at the gateway level.