home / skills / yonatangross / orchestkit / semantic-caching

semantic-caching skill

/plugins/ork/skills/semantic-caching

This skill helps reduce LLM costs by caching responses semantically using Redis, enabling multi-level caches and fast lookups.

npx playbooks add skill yonatangross/orchestkit --skill semantic-caching

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
4.2 KB
---
name: semantic-caching
description: Redis semantic caching for LLM applications. Use when implementing vector similarity caching, optimizing LLM costs through cached responses, or building multi-level cache hierarchies.
tags: [caching, semantic, redis, llm, cost]
context: fork
agent: data-pipeline-engineer
version: 1.0.0
author: OrchestKit
user-invocable: false
---

# Semantic Caching

Cache LLM responses by semantic similarity.

## Cache Hierarchy

```
Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM)
           ~1ms         ~10ms           ~2s          ~3s
         100% save    100% save       90% save    Full cost
```

## Redis Semantic Cache

```python
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

class SemanticCacheService:
    def __init__(self, redis_url: str, threshold: float = 0.92):
        self.client = Redis.from_url(redis_url)
        self.threshold = threshold

    async def get(self, content: str, agent_type: str) -> dict | None:
        embedding = await embed_text(content[:2000])

        query = VectorQuery(
            vector=embedding,
            vector_field_name="embedding",
            filter_expression=f"@agent_type:{{{agent_type}}}",
            num_results=1
        )

        results = self.index.query(query)

        if results:
            distance = float(results[0].get("vector_distance", 1.0))
            if distance <= (1 - self.threshold):
                return json.loads(results[0]["response"])

        return None

    async def set(self, content: str, response: dict, agent_type: str):
        embedding = await embed_text(content[:2000])
        key = f"cache:{agent_type}:{hash_content(content)}"

        self.client.hset(key, mapping={
            "agent_type": agent_type,
            "embedding": embedding,
            "response": json.dumps(response),
            "created_at": time.time(),
        })
        self.client.expire(key, 86400)  # 24h TTL
```

## Similarity Thresholds

| Threshold | Distance | Use Case |
|-----------|----------|----------|
| 0.98-1.00 | 0.00-0.02 | Nearly identical |
| 0.95-0.98 | 0.02-0.05 | Very similar |
| 0.92-0.95 | 0.05-0.08 | Similar (default) |
| 0.85-0.92 | 0.08-0.15 | Moderately similar |

## Multi-Level Lookup

```python
async def get_llm_response(query: str, agent_type: str) -> dict:
    # L1: Exact match (in-memory LRU)
    cache_key = hash_content(query)
    if cache_key in lru_cache:
        return lru_cache[cache_key]

    # L2: Semantic similarity (Redis)
    similar = await semantic_cache.get(query, agent_type)
    if similar:
        lru_cache[cache_key] = similar  # Promote to L1
        return similar

    # L3/L4: LLM call with prompt caching
    response = await llm.generate(query)

    # Store in caches
    await semantic_cache.set(query, response, agent_type)
    lru_cache[cache_key] = response

    return response
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Threshold | Start at 0.92, tune based on hit rate |
| TTL | 24h for production |
| Embedding | text-embedding-3-small (fast) |
| L1 size | 1000-10000 entries |

## Common Mistakes

- Threshold too low (false positives)
- No cache warming (cold start)
- Missing metadata filters
- Not promoting L2 hits to L1

## Related Skills

- `prompt-caching` - Provider-native caching
- `embeddings` - Vector generation
- `cache-cost-tracking` - Langfuse integration

## Capability Details

### redis-vector-cache
**Keywords:** redis, vector, embedding, similarity, cache
**Solves:**
- Cache LLM responses by semantic similarity
- Reduce API costs with smart caching
- Implement multi-level cache hierarchy

### similarity-threshold
**Keywords:** threshold, similarity, tuning, cosine
**Solves:**
- Set appropriate similarity threshold
- Balance hit rate vs accuracy
- Tune cache performance

### orchestkit-integration
**Keywords:** orchestkit, integration, roi, cost-savings
**Solves:**
- Integrate caching with OrchestKit
- Calculate ROI for caching
- Production implementation guide

### cache-service
**Keywords:** service, implementation, template, production
**Solves:**
- Production cache service template
- Complete implementation example
- Redis integration code

Overview

This skill provides a production-ready Redis semantic caching layer for LLM applications, enabling vector similarity lookups to reuse prior responses. It implements a multi-level cache hierarchy (exact, semantic, prompt, LLM) to reduce API calls and lower inference cost while preserving response relevance. The skill is written in TypeScript-ready patterns and includes tuning guidance and integration notes.

How this skill works

On each request it first checks an L1 exact-match cache (LRU). If no exact hit, it computes an embedding for the query and performs a vector similarity search in Redis (L2). If a close-enough item is found within a configurable threshold it returns the cached response and promotes it to L1. Misses fall back to the LLM (L3/L4), then the response is stored in Redis and L1 with TTL for future reuse.

When to use it

  • You need to reduce LLM API cost by reusing semantically similar responses.
  • Building a multi-level cache for high-throughput production agents.
  • Implementing RAG systems where repeated queries should return cached answers.
  • When you want to control hit rates through similarity thresholds and TTLs.
  • Adding caching to Claude Code or other LLM-based services to improve latency and stability.

Best practices

  • Start with a conservative similarity threshold (≈0.92) and tune based on measured hit rate and false positives.
  • Use short prefixes of inputs (e.g., first 2000 chars) when embedding to bound cost and latency.
  • Promote L2 semantic hits to L1 in-memory cache to optimize repeated access patterns.
  • Include metadata filters (agent_type, user, or tenant) in vector queries to avoid cross-context pollution.
  • Set a reasonable TTL (24h recommended) and consider cache warming for crucial prompts.

Example use cases

  • Customer support agent that returns cached answers for similar user questions to cut API spend.
  • Internal knowledge base search that returns previous LLM responses for semantically matching queries.
  • Multi-tenant assistant where agent_type filters ensure per-agent cache isolation.
  • Batch-processing pipeline that avoids redundant LLM calls by reusing recent responses.
  • Cost-tracking integration to compute ROI of caching vs. raw LLM spend.

FAQ

How do I choose a similarity threshold?

Start near 0.92 (balance of recall/precision). Monitor false positives and hit rate, then raise threshold if incorrect matches occur or lower it to increase hits.

What embedding model should I use?

Prefer a fast, lower-cost model like text-embedding-3-small for real-time caching; swap to larger models only if semantic precision demands it.

How long should cached entries live?

24 hours is a strong default for production. Shorten TTL for volatile domains or lengthen for stable knowledge.