home / skills / yonatangross / orchestkit / prompt-caching

prompt-caching skill

safe

/plugins/ork/skills/prompt-caching

This skill helps reduce LLM costs by caching prompt prefixes and breakpoints for Claude and OpenAI across repeated requests.

npx playbooks add skill yonatangross/orchestkit --skill prompt-caching

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

6.8 KB

---
name: prompt-caching
description: Provider-native prompt caching for Claude and OpenAI. Use when optimizing LLM costs with cache breakpoints, caching system prompts, or reducing token costs for repeated prefixes.
tags: [llm, caching, cost-optimization, anthropic]
context: fork
agent: llm-integrator
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# Prompt Caching

Cache LLM prompt prefixes for 90% token savings.

## Supported Models (2026)

| Provider | Models |
|----------|--------|
| Claude | Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3 |
| OpenAI | gpt-4o, gpt-4o-mini, o1, o1-mini (automatic caching) |

## Claude Prompt Caching

```python
def build_cached_messages(
    system_prompt: str,
    few_shot_examples: str | None,
    user_content: str,
    use_extended_cache: bool = False
) -> list[dict]:
    """Build messages with cache breakpoints.

    Cache structure (processing order: tools → system → messages):
    1. System prompt (cached)
    2. Few-shot examples (cached)
    ─────── CACHE BREAKPOINT ───────
    3. User content (NOT cached)
    """
    # TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
    ttl = "1h" if use_extended_cache else "5m"

    content_parts = []

    # Breakpoint 1: System prompt
    content_parts.append({
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral", "ttl": ttl}
    })

    # Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
    if few_shot_examples:
        content_parts.append({
            "type": "text",
            "text": few_shot_examples,
            "cache_control": {"type": "ephemeral", "ttl": ttl}
        })

    # Dynamic content (NOT cached)
    content_parts.append({
        "type": "text",
        "text": user_content
    })

    return [{"role": "user", "content": content_parts}]
```

## Cache Pricing (2026)

```
┌─────────────────────────────────────────────────────────────┐
│  Cache Cost Multipliers (relative to base input price)      │
├─────────────────────────────────────────────────────────────┤
│  5-minute cache write:  1.25x base input price              │
│  1-hour cache write:    2.00x base input price              │
│  Cache read:            0.10x base input price (90% off!)   │
└─────────────────────────────────────────────────────────────┘

Example: Claude Sonnet 4 @ $3/MTok input

Without Prompt Caching:
System prompt:     2,000 tokens @ $3/MTok  = $0.006
Few-shot examples: 5,000 tokens @ $3/MTok  = $0.015
User content:     10,000 tokens @ $3/MTok  = $0.030
───────────────────────────────────────────────────
Total:            17,000 tokens            = $0.051

With 5m Caching (first request = cache write):
Cached prefix:     7,000 tokens @ $3.75/MTok = $0.02625 (1.25x)
User content:     10,000 tokens @ $3/MTok    = $0.03000
Total first req:                             = $0.05625

With 5m Caching (subsequent = cache read):
Cached prefix:     7,000 tokens @ $0.30/MTok = $0.0021 (0.1x)
User content:     10,000 tokens @ $3/MTok    = $0.0300
Total cached req:                            = $0.0321

Savings: 37% per cached request, break-even after 2 requests
```

## Extended Cache (1-hour TTL)

Use 1-hour cache when:
- Prompt reused > 10 times per hour
- System prompts are highly stable
- Token count > 10k (maximize savings)

```python
# Extended cache: 2x write cost but persists 12x longer
"cache_control": {"type": "ephemeral", "ttl": "1h"}

# Break-even: 1h cache pays off after ~8 reads
# (2x write cost ÷ 0.9 savings per read ≈ 8 reads)
```

## OpenAI Automatic Caching

```python
# OpenAI caches prefixes automatically
# No cache_control markers needed

response = await openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_prompt},  # Cached
        {"role": "user", "content": user_content}      # Not cached
    ]
)

# Check cache usage in response
cache_tokens = response.usage.prompt_tokens_cached
```

## Cache Processing Order

```
Cache references entire prompt in order:
1. Tools (cached first)
2. System messages (cached second)
3. User messages (cached last)

⚠️ Extended thinking changes invalidate message caches
   (but NOT system/tools caches)
```

## Monitoring Cache Effectiveness

```python
# Track these fields in API response
response = await client.messages.create(...)

cache_created = response.usage.cache_creation_input_tokens  # New cache
cache_read = response.usage.cache_read_input_tokens         # Cache hit
regular = response.usage.input_tokens                        # Not cached

# Calculate cache hit rate
if cache_created + cache_read > 0:
    hit_rate = cache_read / (cache_created + cache_read)
    print(f"Cache hit rate: {hit_rate:.1%}")
```

## Best Practices

```python
# ✅ Good: Long, stable prefix first
messages = [
    {"role": "system", "content": LONG_SYSTEM_PROMPT},
    {"role": "user", "content": FEW_SHOT_EXAMPLES},
    {"role": "user", "content": user_input}  # Variable
]

# ❌ Bad: Variable content early (breaks cache)
messages = [
    {"role": "user", "content": user_input},  # Breaks cache!
    {"role": "system", "content": LONG_SYSTEM_PROMPT}
]
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Min prefix size | 1,024 tokens (Claude) |
| Breakpoint count | 2-4 per request |
| Content order | Stable prefix first |
| Default TTL | 5m for most cases |
| Extended TTL | 1h if >10 reads/hour |

## Common Mistakes

- Variable content before cached prefix
- Too many breakpoints (overhead)
- Prefix too short (min 1024 tokens)
- Not checking `cache_read_input_tokens`
- Using 1h TTL for infrequent calls (wastes 2x write)

## Related Skills

- `semantic-caching` - Redis similarity caching
- `cache-cost-tracking` - Cost monitoring
- `llm-streaming` - Streaming with caching

## Capability Details

### anthropic-caching
**Keywords:** anthropic, claude, cache_control, ephemeral
**Solves:**
- Use Anthropic prompt caching
- Set cache breakpoints
- Reduce API costs

### openai-caching
**Keywords:** openai, gpt, cached_tokens, automatic
**Solves:**
- Use OpenAI prompt caching
- Structure prompts for cache hits
- Monitor cache effectiveness

### wrapper-template
**Keywords:** wrapper, template, implementation, python
**Solves:**
- Prompt cache wrapper template
- Python implementation
- Drop-in caching layer

Overview

This skill provides provider-native prompt caching for Claude and OpenAI to cut LLM token costs and speed repeated requests. It implements cache breakpoints for system prompts and few-shot examples, supports short and extended TTLs, and exposes metrics to monitor cache effectiveness. Use it to reduce token billings for stable prefixes and optimize prompt structure for cache hits.

How this skill works

The skill marks stable prompt parts (system prompt, few‑shot examples, tools) with cache_control so the provider stores and reuses them. For Claude you choose TTLs (5m default or 1h extended) and insert cache breakpoints before dynamic user content. OpenAI’s newer models perform automatic prefix caching and report cached token counts in responses. The skill also surfaces usage fields for cache creation, cache reads, and non‑cached input tokens for hit‑rate calculations.

When to use it

When system prompts or few‑shot examples are large and stable
When the same prompt prefix is reused across many requests
To reduce token costs for repeated prefixes and save latency
When monitoring and optimizing LLM spend is required
When using Claude models that support explicit cache_control or OpenAI models with automatic caching

Best practices

Place long, stable prefixes (system + examples) before variable user content
Keep breakpoint count moderate (2–4) to avoid overhead
Use 5m TTL by default; switch to 1h only if prefix sees >10 reads/hour
Ensure prefix >= 1,024 tokens for Claude to gain meaningful savings
Track cache_created, cache_read, and input_tokens to compute hit rate

Example use cases

Customer support agent that reuses the same system prompt and examples across many conversations
RAG pipelines that prepend a stable retrieval prompt before variable query content
Batch processing of documents where the same instructions apply to many inputs
Cost-sensitive production services that require predictable token spend and monitoring
A/B testing system prompts while keeping user input dynamic

FAQ

How do I know when caching saved money?

Compare response.usage.cache_read_input_tokens and cache_creation_input_tokens; compute hit rate and convert cached tokens to cost savings using the provider’s token price multipliers.

When should I use 1h TTL instead of 5m?

Use 1h when a prefix is reused more than ~10 times per hour, the prompt is stable, and token counts are large (e.g., >10k) so extended write cost is justified.

What breaks a cache hit?

Changing any content within a cached breakpoint (including extended thinking content) or placing variable content before the cached prefix will invalidate cache hits.

Does OpenAI require manual cache markers?

No. OpenAI models listed perform automatic prefix caching and report cached token usage in the response; you only need to structure messages with stable prefixes first.