home / skills / ancoleman / ai-design-components / embedding-optimization

embedding-optimization skill

/skills/embedding-optimization

This skill helps optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.

npx playbooks add skill ancoleman/ai-design-components --skill embedding-optimization

Review the files below or copy the command above to add this skill to your agents.

Files (11)
SKILL.md
8.1 KB
---
name: embedding-optimization
description: Optimizing vector embeddings for RAG systems through model selection, chunking strategies, caching, and performance tuning. Use when building semantic search, RAG pipelines, or document retrieval systems that require cost-effective, high-quality embeddings.
---

# Embedding Optimization

Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.

## When to Use This Skill

Trigger this skill when:
- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search or similarity detection
- Optimizing embedding API costs (reducing by 70-90%)
- Improving document retrieval quality through better chunking
- Processing large document corpora (thousands to millions of documents)
- Selecting between API-based vs. local embedding models

## Model Selection Framework

Choose the optimal embedding model based on requirements:

**Quick Recommendations:**
- **Startup/MVP:** `all-MiniLM-L6-v2` (local, 384 dims, zero API costs)
- **Production:** `text-embedding-3-small` (API, 1,536 dims, balanced quality/cost)
- **High Quality:** `text-embedding-3-large` (API, 3,072 dims, premium)
- **Multilingual:** `multilingual-e5-base` (local, 768 dims) or Cohere `embed-multilingual-v3.0`

For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see `references/model-selection-guide.md`.

**Model Comparison Summary:**

| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|-------|------|-----------|-------------------|----------|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |

## Chunking Strategies

Select chunking strategy based on content type and use case:

**Content Type → Strategy Mapping:**
- **Documentation:** Recursive (heading-aware), 800 chars, 100 overlap
- **Code:** Recursive (function-level), 1,000 chars, 100 overlap
- **Q&A/FAQ:** Fixed-size, 500 chars, 50 overlap (precise retrieval)
- **Legal/Technical:** Semantic (large), 1,500 chars, 200 overlap (context preservation)
- **Blog Posts:** Semantic (paragraph), 1,000 chars, 100 overlap
- **Academic Papers:** Recursive (section-aware), 1,200 chars, 150 overlap

For detailed chunking patterns, decision trees, and implementation guidance, see `references/chunking-strategies.md`.

**Quick Start with CLI:**
```bash
python scripts/chunk_document.py \
  --input document.txt \
  --content-type markdown \
  --chunk-size 800 \
  --overlap 100 \
  --output chunks.jsonl
```

## Caching Implementation

Achieve 80-90% cost reduction through content-addressable caching.

**Caching Architecture by Query Volume:**
- **<10K queries/month:** In-memory cache (Python `lru_cache`)
- **10K-100K queries/month:** Redis (fast, TTL-based expiration)
- **100K-1M queries/month:** Redis (hot) + PostgreSQL (warm)
- **>1M queries/month:** Multi-tier (Redis + PostgreSQL + S3)

**Production Caching with Redis:**
```bash
# Embed documents with caching enabled
python scripts/cached_embedder.py \
  --model text-embedding-3-small \
  --input documents.jsonl \
  --output embeddings.npy \
  --cache-backend redis \
  --cache-ttl 2592000  # 30 days
```

**Caching ROI Example:**
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- **Savings: 60% ($0.30)**

## Dimensionality Trade-offs

Balance storage, search speed, and quality:

| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|-----------|---------------------|-------------------|---------|----------|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |

**Key Insight:** For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.

## Batch Processing Optimization

Maximize throughput for large-scale ingestion:

**OpenAI API:**
- Batch up to 2,048 inputs per request
- Implement rate limiting (tier-dependent: 500-5,000 RPM)
- Use parallel requests with backoff on rate limits

**Local Models (sentence-transformers):**
- GPU acceleration (CUDA, MPS for Apple Silicon)
- Batch size tuning (32-128 based on GPU memory)
- Multi-GPU support for maximum throughput

**Expected Throughput:**
- OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
- Local GPU (RTX 3090): 5,000-10,000 texts/minute
- Local CPU: 100-500 texts/minute

## Performance Monitoring

Track key metrics for optimization:

**Critical Metrics:**
- **Latency:** Embedding generation time (p50, p95, p99)
- **Throughput:** Embeddings per second/minute
- **Cost:** API usage tracking (USD per 1K/1M tokens)
- **Cache Efficiency:** Hit rate percentage

For detailed monitoring setup, metric collection patterns, and dashboarding, see `references/performance-monitoring.md`.

**Monitor with Wrapper:**
```python
from scripts.performance_monitor import MonitoredEmbedder

monitored = MonitoredEmbedder(
    embedder=your_embedder,
    cost_per_1k_tokens=0.00002  # OpenAI pricing
)

embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")
```

## Working Examples

See `examples/` directory for complete implementations:

**Python Examples:**
- `examples/openai_cached.py` - OpenAI embeddings with Redis caching
- `examples/local_embedder.py` - sentence-transformers local embedding
- `examples/smart_chunker.py` - Content-aware recursive chunking
- `examples/performance_monitor.py` - Pipeline performance tracking
- `examples/batch_processor.py` - Large-scale document processing

All examples include:
- Complete, runnable code
- Dependency installation instructions
- Error handling and retry logic
- Configuration options

## Integration Points

**Upstream (This skill provides to):**
- **Vector Databases:** Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
- **RAG Systems:** Optimized embeddings for retrieval pipelines
- **Semantic Search:** Query and document embeddings for similarity search

**Downstream (This skill uses from):**
- **Document Processing:** Chunk documents before embedding
- **Data Ingestion:** Process documents from various sources

**Related Skills:**
- For RAG architecture, see `building-ai-chat` skill
- For vector database operations, see `databases-vector` skill
- For data ingestion pipelines, see `ingesting-data` skill

## Common Patterns

**Pattern 1: RAG Pipeline**
```
Document → Chunk → Embed → Store (vector DB) → Retrieve
```

**Pattern 2: Semantic Search**
```
Query → Embed → Search (vector DB) → Rank → Display
```

**Pattern 3: Multi-Stage Retrieval (Cost Optimization)**
```
Query → Cheap Embedding (384d) → Initial Search →
Expensive Embedding (1,536d) → Rerank Top-K → Return
```
**Cost Savings:** 70% reduction vs. single-stage with expensive embeddings

## Quick Reference Checklist

**Model Selection:**
- [ ] Identified data privacy requirements (local vs. API)
- [ ] Calculated expected query volume
- [ ] Determined quality requirements (good/high/highest)
- [ ] Checked multilingual support needs

**Chunking:**
- [ ] Analyzed content type (code, docs, legal, etc.)
- [ ] Selected appropriate chunk size (500-1,500 chars)
- [ ] Set overlap to prevent context loss (50-200 chars)
- [ ] Validated chunks preserve semantic boundaries

**Caching:**
- [ ] Implemented content-addressable hashing
- [ ] Selected cache backend (Redis, PostgreSQL)
- [ ] Set TTL based on content volatility
- [ ] Monitoring cache hit rate (target: >60%)

**Performance:**
- [ ] Tracking latency (embedding generation time)
- [ ] Measuring throughput (embeddings/sec)
- [ ] Monitoring costs (USD spent on API calls)
- [ ] Optimizing batch sizes for maximum efficiency

Overview

This skill optimizes vector embedding pipelines for RAG and semantic search systems to balance cost, speed, and retrieval quality. It provides concrete model selection guidance, chunking strategies, caching architectures, and production tuning patterns for large-scale document ingestion. Use it to cut embedding API costs and improve retrieval relevance while scaling throughput and monitoring performance.

How this skill works

The skill inspects content type and volume to recommend a model and chunking approach, then applies caching and batching strategies to reduce repeated API calls and maximize GPU/CPU throughput. It includes dimensionality trade-offs, multi-tier cache architectures, and monitoring wrappers to track latency, throughput, cost, and cache efficiency. Practical scripts and examples demonstrate local vs API embedding flows, smart chunkers, and production-ready caching with Redis/Postgres/S3.

When to use it

  • Building RAG pipelines or semantic search over large corpora
  • Reducing embedding API costs while preserving retrieval quality
  • Selecting between local and API embedding models based on privacy and budget
  • Processing thousands to millions of documents for ingestion
  • Tuning throughput for GPU/CPU batch embedding or API rate limits

Best practices

  • Pick model dimensionality based on quality vs storage/speed trade-offs (768 often best balance)
  • Use content-aware chunking (recursive/semantic) with overlap to preserve context
  • Implement content-addressable caching and choose cache backend by query volume
  • Batch requests and tune batch sizes for GPU memory or API limits; add backoff/retries
  • Monitor latency, throughput, cost, and cache hit rate; target >60% cache hits

Example use cases

  • High-volume semantic search using local 384/768-dim models to minimize API spend
  • Production RAG with 1,536-dim API embeddings for balanced quality and cost
  • Multi-stage retrieval: cheap embedding for candidate search, expensive embedding for rerank
  • Large-scale ingestion pipeline with Redis hot cache + Postgres warm store + S3 cold store
  • Academic or legal document retrieval using section-aware chunking and larger contexts

FAQ

Which model should I choose for a budget-constrained MVP?

Use a local all-MiniLM-L6-v2 or BGE-base variant (384–768 dims) to avoid API costs while getting reasonable quality.

How do I size cache layers for 100k–1M queries/month?

Use Redis as the hot layer, PostgreSQL as warm storage, and S3 for cold archival; tune TTLs to balance hit rate and storage.