home / skills / yonatangross / orchestkit / semantic-caching

semantic-caching skill

safe

This skill helps reduce LLM costs by caching responses semantically using Redis, enabling multi-level caches and fast lookups.

npx playbooks add skill yonatangross/orchestkit --skill semantic-caching

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

4.2 KB

---
name: semantic-caching
description: Redis semantic caching for LLM applications. Use when implementing vector similarity caching, optimizing LLM costs through cached responses, or building multi-level cache hierarchies.
tags: [caching, semantic, redis, llm, cost]
context: fork
agent: data-pipeline-engineer
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# Semantic Caching

Cache LLM responses by semantic similarity.

## Cache Hierarchy

```
Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM)
           ~1ms         ~10ms           ~2s          ~3s
         100% save    100% save       90% save    Full cost
```

## Redis Semantic Cache

```python
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

class SemanticCacheService:
    def __init__(self, redis_url: str, threshold: float = 0.92):
        self.client = Redis.from_url(redis_url)
        self.threshold = threshold

    async def get(self, content: str, agent_type: str) -> dict | None:
        embedding = await embed_text(content[:2000])

        query = VectorQuery(
            vector=embedding,
            vector_field_name="embedding",
            filter_expression=f"@agent_type:{{{agent_type}}}",
            num_results=1
        )

        results = self.index.query(query)

        if results:
            distance = float(results[0].get("vector_distance", 1.0))
            if distance <= (1 - self.threshold):
                return json.loads(results[0]["response"])

        return None

    async def set(self, content: str, response: dict, agent_type: str):
        embedding = await embed_text(content[:2000])
        key = f"cache:{agent_type}:{hash_content(content)}"

        self.client.hset(key, mapping={
            "agent_type": agent_type,
            "embedding": embedding,
            "response": json.dumps(response),
            "created_at": time.time(),
        })
        self.client.expire(key, 86400)  # 24h TTL
```

## Similarity Thresholds

| Threshold | Distance | Use Case |
|-----------|----------|----------|
| 0.98-1.00 | 0.00-0.02 | Nearly identical |
| 0.95-0.98 | 0.02-0.05 | Very similar |
| 0.92-0.95 | 0.05-0.08 | Similar (default) |
| 0.85-0.92 | 0.08-0.15 | Moderately similar |

## Multi-Level Lookup

```python
async def get_llm_response(query: str, agent_type: str) -> dict:
    # L1: Exact match (in-memory LRU)
    cache_key = hash_content(query)
    if cache_key in lru_cache:
        return lru_cache[cache_key]

    # L2: Semantic similarity (Redis)
    similar = await semantic_cache.get(query, agent_type)
    if similar:
        lru_cache[cache_key] = similar  # Promote to L1
        return similar

    # L3/L4: LLM call with prompt caching
    response = await llm.generate(query)

    # Store in caches
    await semantic_cache.set(query, response, agent_type)
    lru_cache[cache_key] = response

    return response
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Threshold | Start at 0.92, tune based on hit rate |
| TTL | 24h for production |
| Embedding | text-embedding-3-small (fast) |
| L1 size | 1000-10000 entries |

## Common Mistakes

- Threshold too low (false positives)
- No cache warming (cold start)
- Missing metadata filters
- Not promoting L2 hits to L1

## Related Skills

- `prompt-caching` - Provider-native caching
- `embeddings` - Vector generation
- `cache-cost-tracking` - Langfuse integration

## Capability Details

### redis-vector-cache
**Keywords:** redis, vector, embedding, similarity, cache
**Solves:**
- Cache LLM responses by semantic similarity
- Reduce API costs with smart caching
- Implement multi-level cache hierarchy

### similarity-threshold
**Keywords:** threshold, similarity, tuning, cosine
**Solves:**
- Set appropriate similarity threshold
- Balance hit rate vs accuracy
- Tune cache performance

### orchestkit-integration
**Keywords:** orchestkit, integration, roi, cost-savings
**Solves:**
- Integrate caching with OrchestKit
- Calculate ROI for caching
- Production implementation guide

### cache-service
**Keywords:** service, implementation, template, production
**Solves:**
- Production cache service template
- Complete implementation example
- Redis integration code

Overview

This skill provides a production-ready Redis semantic caching layer for LLM applications, enabling vector similarity lookups to reuse prior responses. It implements a multi-level cache hierarchy (exact, semantic, prompt, LLM) to reduce API calls and lower inference cost while preserving response relevance. The skill is written in TypeScript-ready patterns and includes tuning guidance and integration notes.

How this skill works

On each request it first checks an L1 exact-match cache (LRU). If no exact hit, it computes an embedding for the query and performs a vector similarity search in Redis (L2). If a close-enough item is found within a configurable threshold it returns the cached response and promotes it to L1. Misses fall back to the LLM (L3/L4), then the response is stored in Redis and L1 with TTL for future reuse.

When to use it

You need to reduce LLM API cost by reusing semantically similar responses.
Building a multi-level cache for high-throughput production agents.
Implementing RAG systems where repeated queries should return cached answers.
When you want to control hit rates through similarity thresholds and TTLs.
Adding caching to Claude Code or other LLM-based services to improve latency and stability.

Best practices

Start with a conservative similarity threshold (≈0.92) and tune based on measured hit rate and false positives.
Use short prefixes of inputs (e.g., first 2000 chars) when embedding to bound cost and latency.
Promote L2 semantic hits to L1 in-memory cache to optimize repeated access patterns.
Include metadata filters (agent_type, user, or tenant) in vector queries to avoid cross-context pollution.
Set a reasonable TTL (24h recommended) and consider cache warming for crucial prompts.

Example use cases

Customer support agent that returns cached answers for similar user questions to cut API spend.
Internal knowledge base search that returns previous LLM responses for semantically matching queries.
Multi-tenant assistant where agent_type filters ensure per-agent cache isolation.
Batch-processing pipeline that avoids redundant LLM calls by reusing recent responses.
Cost-tracking integration to compute ROI of caching vs. raw LLM spend.

FAQ

How do I choose a similarity threshold?

Start near 0.92 (balance of recall/precision). Monitor false positives and hit rate, then raise threshold if incorrect matches occur or lower it to increase hits.

What embedding model should I use?

Prefer a fast, lower-cost model like text-embedding-3-small for real-time caching; swap to larger models only if semantic precision demands it.

How long should cached entries live?

24 hours is a strong default for production. Shorten TTL for volatile domains or lengthen for stable knowledge.