home / skills / omer-metin / skills-for-antigravity / prompt-caching

prompt-caching skill

safe

This skill helps you implement prompt caching strategies for LLMs, covering prompt prefixes, full responses, and semantic similarity to reduce costs.

npx playbooks add skill omer-metin/skills-for-antigravity --skill prompt-caching

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

1.8 KB

---
name: prompt-caching
description: Caching strategies for LLM prompts including Anthropic prompt caching, response caching, and CAG (Cache Augmented Generation)Use when "prompt caching, cache prompt, response cache, cag, cache augmented, caching, llm, performance, optimization, cost" mentioned. 
---

# Prompt Caching

## Identity

You're a caching specialist who has reduced LLM costs by 90% through strategic caching.
You've implemented systems that cache at multiple levels: prompt prefixes, full responses,
and semantic similarity matches.

You understand that LLM caching is different from traditional caching—prompts have
prefixes that can be cached, responses vary with temperature, and semantic similarity
often matters more than exact match.

Your core principles:
1. Cache at the right level—prefix, response, or both
2. Know your cache hit rates—measure or you can't improve
3. Invalidation is hard—design for it upfront
4. CAG vs RAG tradeoff—understand when each wins
5. Cost awareness—caching should save money


## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill explains practical caching strategies for LLM prompts to cut latency and cost while preserving correctness. It covers layered caching (prefix, full response), temperature-aware rules, and Cache-Augmented Generation (CAG) tradeoffs versus retrieval. The guidance is grounded in established design patterns, failure modes, and strict validation rules used for reliable deployments.

How this skill works

The skill inspects prompt structure to identify cacheable prefixes and whole-call responses, and applies different caches depending on temperature, model, and response determinism. It supports semantic similarity matching for approximate hits, tracks hit rates and cost impact, and defines invalidation rules to avoid serving stale or unsafe outputs. CAG integration returns cached candidates to the model as context rather than replacing generation entirely.

When to use it

High-volume, repetitive prompts where prefixes are stable across calls
Applications with strict cost or latency targets for model calls
When responses are mostly deterministic (low temperature) and can be reused
Use CAG when you can accept model re-ranking with cached candidates
When you need a middle ground between full response cache and RAG

Best practices

Cache at multiple levels: prefix caches for repeated context and response caches for deterministic outputs
Measure cache hit rate, cost saved, and error surface continuously before optimization
Tag cache entries with model, temperature, prompt-hash, and content-version for safe reuse
Design invalidation strategies up front (time-to-live, content-version bump, or explicit purge)
Prefer semantic similarity for user-facing paraphrase reuse; set conservative similarity thresholds and monitor false positives
Log cache misses with sample prompts to improve caching rules and detect drift

Example use cases

Customer support bots where system instructions and flow prompts remain stable
Form-filling assistants that reuse templated prefix instructions across users
High-throughput analytics pipelines that batch identical prompt prefixes
CAG for knowledge-heavy tasks: return cached candidate answers for the model to adapt
Cost-sensitive prototypes that need immediate reduction in API call volume

FAQ

When should I prefer CAG over response caching?

Use CAG when cached content can help guide generation but you still need the model to adapt answers to new inputs; choose full response cache when outputs are repeatable and exact reuse is safe.

How do I avoid serving stale cached responses?

Apply version tags, TTLs, and explicit invalidation triggers tied to data updates; monitor semantic drift and treat cache entries conservatively for changing knowledge.