home / skills / yonatangross / orchestkit / prompt-caching
/plugins/ork/skills/prompt-caching
This skill helps reduce LLM costs by caching prompt prefixes and breakpoints for Claude and OpenAI across repeated requests.
npx playbooks add skill yonatangross/orchestkit --skill prompt-cachingReview the files below or copy the command above to add this skill to your agents.
---
name: prompt-caching
description: Provider-native prompt caching for Claude and OpenAI. Use when optimizing LLM costs with cache breakpoints, caching system prompts, or reducing token costs for repeated prefixes.
tags: [llm, caching, cost-optimization, anthropic]
context: fork
agent: llm-integrator
version: 1.0.0
author: OrchestKit
user-invocable: false
---
# Prompt Caching
Cache LLM prompt prefixes for 90% token savings.
## Supported Models (2026)
| Provider | Models |
|----------|--------|
| Claude | Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3 |
| OpenAI | gpt-4o, gpt-4o-mini, o1, o1-mini (automatic caching) |
## Claude Prompt Caching
```python
def build_cached_messages(
system_prompt: str,
few_shot_examples: str | None,
user_content: str,
use_extended_cache: bool = False
) -> list[dict]:
"""Build messages with cache breakpoints.
Cache structure (processing order: tools → system → messages):
1. System prompt (cached)
2. Few-shot examples (cached)
─────── CACHE BREAKPOINT ───────
3. User content (NOT cached)
"""
# TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
ttl = "1h" if use_extended_cache else "5m"
content_parts = []
# Breakpoint 1: System prompt
content_parts.append({
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
if few_shot_examples:
content_parts.append({
"type": "text",
"text": few_shot_examples,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Dynamic content (NOT cached)
content_parts.append({
"type": "text",
"text": user_content
})
return [{"role": "user", "content": content_parts}]
```
## Cache Pricing (2026)
```
┌─────────────────────────────────────────────────────────────┐
│ Cache Cost Multipliers (relative to base input price) │
├─────────────────────────────────────────────────────────────┤
│ 5-minute cache write: 1.25x base input price │
│ 1-hour cache write: 2.00x base input price │
│ Cache read: 0.10x base input price (90% off!) │
└─────────────────────────────────────────────────────────────┘
Example: Claude Sonnet 4 @ $3/MTok input
Without Prompt Caching:
System prompt: 2,000 tokens @ $3/MTok = $0.006
Few-shot examples: 5,000 tokens @ $3/MTok = $0.015
User content: 10,000 tokens @ $3/MTok = $0.030
───────────────────────────────────────────────────
Total: 17,000 tokens = $0.051
With 5m Caching (first request = cache write):
Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x)
User content: 10,000 tokens @ $3/MTok = $0.03000
Total first req: = $0.05625
With 5m Caching (subsequent = cache read):
Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x)
User content: 10,000 tokens @ $3/MTok = $0.0300
Total cached req: = $0.0321
Savings: 37% per cached request, break-even after 2 requests
```
## Extended Cache (1-hour TTL)
Use 1-hour cache when:
- Prompt reused > 10 times per hour
- System prompts are highly stable
- Token count > 10k (maximize savings)
```python
# Extended cache: 2x write cost but persists 12x longer
"cache_control": {"type": "ephemeral", "ttl": "1h"}
# Break-even: 1h cache pays off after ~8 reads
# (2x write cost ÷ 0.9 savings per read ≈ 8 reads)
```
## OpenAI Automatic Caching
```python
# OpenAI caches prefixes automatically
# No cache_control markers needed
response = await openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt}, # Cached
{"role": "user", "content": user_content} # Not cached
]
)
# Check cache usage in response
cache_tokens = response.usage.prompt_tokens_cached
```
## Cache Processing Order
```
Cache references entire prompt in order:
1. Tools (cached first)
2. System messages (cached second)
3. User messages (cached last)
⚠️ Extended thinking changes invalidate message caches
(but NOT system/tools caches)
```
## Monitoring Cache Effectiveness
```python
# Track these fields in API response
response = await client.messages.create(...)
cache_created = response.usage.cache_creation_input_tokens # New cache
cache_read = response.usage.cache_read_input_tokens # Cache hit
regular = response.usage.input_tokens # Not cached
# Calculate cache hit rate
if cache_created + cache_read > 0:
hit_rate = cache_read / (cache_created + cache_read)
print(f"Cache hit rate: {hit_rate:.1%}")
```
## Best Practices
```python
# ✅ Good: Long, stable prefix first
messages = [
{"role": "system", "content": LONG_SYSTEM_PROMPT},
{"role": "user", "content": FEW_SHOT_EXAMPLES},
{"role": "user", "content": user_input} # Variable
]
# ❌ Bad: Variable content early (breaks cache)
messages = [
{"role": "user", "content": user_input}, # Breaks cache!
{"role": "system", "content": LONG_SYSTEM_PROMPT}
]
```
## Key Decisions
| Decision | Recommendation |
|----------|----------------|
| Min prefix size | 1,024 tokens (Claude) |
| Breakpoint count | 2-4 per request |
| Content order | Stable prefix first |
| Default TTL | 5m for most cases |
| Extended TTL | 1h if >10 reads/hour |
## Common Mistakes
- Variable content before cached prefix
- Too many breakpoints (overhead)
- Prefix too short (min 1024 tokens)
- Not checking `cache_read_input_tokens`
- Using 1h TTL for infrequent calls (wastes 2x write)
## Related Skills
- `semantic-caching` - Redis similarity caching
- `cache-cost-tracking` - Cost monitoring
- `llm-streaming` - Streaming with caching
## Capability Details
### anthropic-caching
**Keywords:** anthropic, claude, cache_control, ephemeral
**Solves:**
- Use Anthropic prompt caching
- Set cache breakpoints
- Reduce API costs
### openai-caching
**Keywords:** openai, gpt, cached_tokens, automatic
**Solves:**
- Use OpenAI prompt caching
- Structure prompts for cache hits
- Monitor cache effectiveness
### wrapper-template
**Keywords:** wrapper, template, implementation, python
**Solves:**
- Prompt cache wrapper template
- Python implementation
- Drop-in caching layer
This skill provides provider-native prompt caching for Claude and OpenAI to cut LLM token costs and speed repeated requests. It implements cache breakpoints for system prompts and few-shot examples, supports short and extended TTLs, and exposes metrics to monitor cache effectiveness. Use it to reduce token billings for stable prefixes and optimize prompt structure for cache hits.
The skill marks stable prompt parts (system prompt, few‑shot examples, tools) with cache_control so the provider stores and reuses them. For Claude you choose TTLs (5m default or 1h extended) and insert cache breakpoints before dynamic user content. OpenAI’s newer models perform automatic prefix caching and report cached token counts in responses. The skill also surfaces usage fields for cache creation, cache reads, and non‑cached input tokens for hit‑rate calculations.
How do I know when caching saved money?
Compare response.usage.cache_read_input_tokens and cache_creation_input_tokens; compute hit rate and convert cached tokens to cost savings using the provider’s token price multipliers.
When should I use 1h TTL instead of 5m?
Use 1h when a prefix is reused more than ~10 times per hour, the prompt is stable, and token counts are large (e.g., >10k) so extended write cost is justified.
What breaks a cache hit?
Changing any content within a cached breakpoint (including extended thinking content) or placing variable content before the cached prefix will invalidate cache hits.
Does OpenAI require manual cache markers?
No. OpenAI models listed perform automatic prefix caching and report cached token usage in the response; you only need to structure messages with stable prefixes first.