home / skills / ancoleman / ai-design-components / using-vector-databases

using-vector-databases skill

needs review

This skill helps you design and implement vector DB powered RAG and semantic search workflows across Python and TypeScript.

npx playbooks add skill ancoleman/ai-design-components --skill using-vector-databases

Review the files below or copy the command above to add this skill to your agents.

Files (22)

SKILL.md

13.9 KB

---
name: using-vector-databases
description: Vector database implementation for AI/ML applications, semantic search, and RAG systems. Use when building chatbots, search engines, recommendation systems, or similarity-based retrieval. Covers Qdrant (primary), Pinecone, Milvus, pgvector, Chroma, embedding generation (OpenAI, Voyage, Cohere), chunking strategies, and hybrid search patterns.
---

# Vector Databases for AI Applications

## When to Use This Skill

Use this skill when implementing:
- **RAG (Retrieval-Augmented Generation)** systems for AI chatbots
- **Semantic search** capabilities (meaning-based, not just keyword)
- **Recommendation systems** based on similarity
- **Multi-modal AI** (unified search across text, images, audio)
- **Document similarity** and deduplication
- **Question answering** over private knowledge bases

## Quick Decision Framework

### 1. Vector Database Selection

```
START: Choosing a Vector Database

EXISTING INFRASTRUCTURE?
├─ Using PostgreSQL already?
│  └─ pgvector (<10M vectors, tight budget)
│      See: references/pgvector.md
│
└─ No existing vector database?
   │
   ├─ OPERATIONAL PREFERENCE?
   │  │
   │  ├─ Zero-ops managed only
   │  │  └─ Pinecone (fully managed, excellent DX)
   │  │      See: references/pinecone.md
   │  │
   │  └─ Flexible (self-hosted or managed)
   │     │
   │     ├─ SCALE: <100M vectors + complex filtering ⭐
   │     │  └─ Qdrant (RECOMMENDED)
   │     │      • Best metadata filtering
   │     │      • Built-in hybrid search (BM25 + Vector)
   │     │      • Self-host: Docker/K8s
   │     │      • Managed: Qdrant Cloud
   │     │      See: references/qdrant.md
   │     │
   │     ├─ SCALE: >100M vectors + GPU acceleration
   │     │  └─ Milvus / Zilliz Cloud
   │     │      See: references/milvus.md
   │     │
   │     ├─ Embedded / No server
   │     │  └─ LanceDB (serverless, edge deployment)
   │     │
   │     └─ Local prototyping
   │        └─ Chroma (simple API, in-memory)
```

### 2. Embedding Model Selection

```
REQUIREMENTS?

├─ Best quality (cost no object)
│  └─ Voyage AI voyage-3 (1024d)
│      • 9.74% better than OpenAI on MTEB
│      • ~$0.12/1M tokens
│      See: references/embedding-strategies.md
│
├─ Enterprise reliability
│  └─ OpenAI text-embedding-3-large (3072d)
│      • Industry standard
│      • ~$0.13/1M tokens
│      • Maturity shortening: reduce to 256/512/1024d
│
├─ Cost-optimized
│  └─ OpenAI text-embedding-3-small (1536d)
│      • ~$0.02/1M tokens (6x cheaper)
│      • 90-95% of large model performance
│
├─ Multilingual (100+ languages)
│  └─ Cohere embed-v3 (1024d)
│      • ~$0.10/1M tokens
│
└─ Self-hosted / Privacy-critical
   ├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0)
   ├─ Multilingual: BAAI/bge-m3 (1024d, MIT)
   └─ Long docs: jina-embeddings-v2 (768d, 8K context)
```

## Core Concepts

### Document Chunking Strategy

**Recommended defaults for most RAG systems:**
- **Chunk size:** 512 tokens (not characters)
- **Overlap:** 50 tokens (10% overlap)

**Why these numbers?**
- 512 tokens balances context vs. precision
  - Too small (128-256): Fragments concepts, loses context
  - Too large (1024-2048): Dilutes relevance, wastes LLM tokens
- 50 token overlap ensures sentences aren't split mid-context

See `references/chunking-patterns.md` for advanced strategies by content type.

### Hybrid Search (Vector + Keyword)

**Hybrid Search = Vector Similarity + BM25 Keyword Matching**

```
User Query: "OAuth refresh token implementation"
           │
    ┌──────┴──────┐
    │             │
Vector Search   Keyword Search
(Semantic)      (BM25)
    │             │
Top 20 docs   Top 20 docs
    │             │
    └──────┬──────┘
           │
   Reciprocal Rank Fusion
   (Merge + Re-rank)
           │
    Final Top 5 Results
```

**Why hybrid matters:**
- Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal")
- Keyword ensures exact matches ("refresh_token" literal)
- Combined provides best retrieval quality

See `references/hybrid-search.md` for implementation details.

## Getting Started

### Python + Qdrant Example

```python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# 1. Initialize client
client = QdrantClient("localhost", port=6333)

# 2. Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

# 3. Insert documents with embeddings
points = [
    PointStruct(
        id=idx,
        vector=embedding,  # From OpenAI/Voyage/etc
        payload={
            "text": chunk_text,
            "source": "docs/api.md",
            "section": "Authentication"
        }
    )
    for idx, (embedding, chunk_text) in enumerate(chunks)
]
client.upsert(collection_name="documents", points=points)

# 4. Search with metadata filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter={
        "must": [
            {"key": "section", "match": {"value": "Authentication"}}
        ]
    }
)
```

For complete examples, see `examples/qdrant-python/`.

### TypeScript + Qdrant Example

```typescript
import { QdrantClient } from '@qdrant/js-client-rest';

const client = new QdrantClient({ url: 'http://localhost:6333' });

// Create collection
await client.createCollection('documents', {
  vectors: { size: 1024, distance: 'Cosine' }
});

// Insert documents
await client.upsert('documents', {
  points: chunks.map((chunk, idx) => ({
    id: idx,
    vector: chunk.embedding,
    payload: {
      text: chunk.text,
      source: chunk.source
    }
  }))
});

// Search
const results = await client.search('documents', {
  vector: queryEmbedding,
  limit: 5,
  filter: {
    must: [
      { key: 'source', match: { value: 'docs/api.md' } }
    ]
  }
});
```

For complete examples, see `examples/typescript-rag/`.

## RAG Pipeline Architecture

### Complete Pipeline Components

```
1. INGESTION
   ├─ Document Loading (PDF, web, code, Office)
   ├─ Text Extraction & Cleaning
   ├─ Chunking (semantic, recursive, code-aware)
   └─ Embedding Generation (batch, rate-limited)

2. INDEXING
   ├─ Vector Store Insertion (batch upsert)
   ├─ Index Configuration (HNSW, distance metric)
   └─ Keyword Index (BM25 for hybrid search)

3. RETRIEVAL (Query Time)
   ├─ Query Processing (expansion, embedding)
   ├─ Hybrid Search (vector + keyword)
   ├─ Filtering & Post-Processing (metadata, MMR)
   └─ Re-Ranking (cross-encoder, LLM-based)

4. GENERATION
   ├─ Context Construction (format chunks, citations)
   ├─ Prompt Engineering (system + context + query)
   ├─ LLM Inference (streaming, temperature tuning)
   └─ Response Post-Processing (citations, validation)

5. EVALUATION (Production Critical)
   ├─ Retrieval Metrics (precision, recall, relevancy)
   ├─ Generation Metrics (faithfulness, correctness)
   └─ System Metrics (latency, cost, satisfaction)
```

## Essential Metadata for Production RAG

**Critical for filtering and relevance:**

```python
metadata = {
    # SOURCE TRACKING
    "source": "docs/api-reference.md",
    "source_type": "documentation",  # code, docs, logs, chat
    "last_updated": "2025-12-01T12:00:00Z",

    # HIERARCHICAL CONTEXT
    "section": "Authentication",
    "subsection": "OAuth 2.1",
    "heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"],

    # CONTENT CLASSIFICATION
    "content_type": "code_example",  # prose, code, table, list
    "programming_language": "python",

    # FILTERING DIMENSIONS
    "product_version": "v2.0",
    "audience": "enterprise",  # free, pro, enterprise

    # RETRIEVAL HINTS
    "chunk_index": 3,
    "total_chunks": 12,
    "has_code": True
}
```

**Why metadata matters:**
- Enables filtering BEFORE vector search (reduces search space)
- Improves relevance through targeted retrieval
- Supports multi-tenant systems (filter by user/org)
- Enables versioned documentation (filter by product version)

## Evaluation with RAGAS

**Use scripts/evaluate_rag.py for automated evaluation:**

```python
from ragas import evaluate
from ragas.metrics import (
    faithfulness,       # Answer grounded in context
    answer_relevancy,   # Answer addresses query
    context_recall,     # Retrieved docs cover ground truth
    context_precision   # Retrieved docs are relevant
)

# Test dataset
test_data = {
    "question": ["How do I refresh OAuth tokens?"],
    "answer": ["Use /token with refresh_token grant..."],
    "contexts": [["OAuth refresh documentation..."]],
    "ground_truth": ["POST to /token with grant_type=refresh_token"]
}

# Evaluate
results = evaluate(test_data, metrics=[
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
])

# Production targets:
# faithfulness: >0.90 (minimal hallucination)
# answer_relevancy: >0.85 (addresses user query)
# context_recall: >0.80 (sufficient context retrieved)
# context_precision: >0.75 (minimal noise)
```

## Performance Optimization

### Embedding Generation
- **Batch processing:** 100-500 chunks per batch
- **Caching:** Cache embeddings by content hash
- **Rate limiting:** Respect API provider limits (exponential backoff)

### Vector Search
- **Index type:** HNSW (Hierarchical Navigable Small World) for most cases
- **Distance metric:** Cosine for normalized embeddings
- **Pre-filtering:** Apply metadata filters before vector search
- **Result diversity:** Use MMR (Maximal Marginal Relevance) to reduce redundancy

### Cost Optimization
- **Embedding model:** Consider text-embedding-3-small for budget constraints
- **Dimension reduction:** Use maturity shortening (3072d → 1024d)
- **Caching:** Implement semantic caching for repeated queries
- **Batch operations:** Group insertions/updates for efficiency

## Common Workflows

### 1. Building a RAG Chatbot
- Vector database: Qdrant (self-hosted or cloud)
- Embeddings: OpenAI text-embedding-3-large
- Chunking: 512 tokens, 50 overlap, semantic splitter
- Search: Hybrid (vector + BM25)
- Integration: Frontend with ai-chat skill

See `examples/qdrant-python/` for complete implementation.

### 2. Semantic Search Engine
- Vector database: Qdrant or Pinecone
- Embeddings: Voyage AI voyage-3 (best quality)
- Chunking: Content-type specific (see chunking-patterns.md)
- Search: Hybrid with re-ranking
- Filtering: Pre-filter by metadata (date, category, etc.)

### 3. Code Search
- Vector database: Qdrant
- Embeddings: OpenAI text-embedding-3-large
- Chunking: AST-based (function/class boundaries)
- Metadata: Language, file path, imports
- Search: Hybrid with language filtering

See `examples/qdrant-python/` for code-specific implementation.

## Integration with Other Skills

### Frontend Skills
- **ai-chat**: Vector DB powers RAG pipeline behind chat interface
- **search-filter**: Replace keyword search with semantic search
- **data-viz**: Visualize embedding spaces, similarity scores

### Backend Skills
- **databases-relational**: Hybrid approach using pgvector extension
- **api-patterns**: Expose semantic search via REST/GraphQL
- **observability**: Monitor embedding quality and retrieval metrics

## Multi-Language Support

### Python (Primary)
- Client: `qdrant-client`
- Framework: LangChain, LlamaIndex
- See: `examples/qdrant-python/`

### Rust
- Client: `qdrant-client` (1,549 code snippets in Context7)
- Framework: Raw Rust for performance-critical systems
- See: `examples/rust-axum-vector/`

### TypeScript
- Client: `@qdrant/js-client-rest`
- Framework: LangChain.js, integration with Next.js
- See: `examples/typescript-rag/`

### Go
- Client: `qdrant-go`
- Use case: High-performance microservices

## Troubleshooting

### Poor Retrieval Quality
1. Check chunking strategy (too large/small?)
2. Verify metadata filtering (too restrictive?)
3. Try hybrid search instead of vector-only
4. Implement re-ranking stage
5. Evaluate with RAGAS metrics

### Slow Performance
1. Use HNSW index (not Flat)
2. Pre-filter with metadata before vector search
3. Reduce vector dimensions (maturity shortening)
4. Batch operations (insertions, searches)
5. Consider GPU acceleration (Milvus)

### High Costs
1. Switch to text-embedding-3-small
2. Implement semantic caching
3. Reduce chunk overlap
4. Use self-hosted embeddings (nomic, bge-m3)
5. Batch embedding generation

## Qdrant Context7 Documentation

**Primary resource:** `/llmstxt/qdrant_tech_llms-full_txt`
- **Trust score:** High
- **Code snippets:** 10,154
- **Quality score:** 83.1

Access via Context7:
```
resolve-library-id({ libraryName: "Qdrant" })
get-library-docs({
  context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt",
  topic: "hybrid search collections python",
  mode: "code"
})
```

## Additional Resources

### Reference Documentation
- `references/qdrant.md` - Comprehensive Qdrant guide
- `references/pgvector.md` - PostgreSQL pgvector extension
- `references/milvus.md` - Milvus/Zilliz for billion-scale
- `references/embedding-strategies.md` - Embedding model comparison
- `references/chunking-patterns.md` - Advanced chunking techniques

### Code Examples
- `examples/qdrant-python/` - FastAPI + Qdrant RAG pipeline
- `examples/pgvector-prisma/` - PostgreSQL + Prisma integration
- `examples/typescript-rag/` - TypeScript RAG with Hono

### Automation Scripts
- `scripts/generate_embeddings.py` - Batch embedding generation
- `scripts/benchmark_similarity.py` - Performance benchmarking
- `scripts/evaluate_rag.py` - RAGAS-based evaluation

---

**Next Steps:**
1. Choose vector database based on scale and infrastructure
2. Select embedding model based on quality vs. cost trade-off
3. Implement chunking strategy for the content type
4. Set up hybrid search for production quality
5. Evaluate with RAGAS metrics
6. Optimize for performance and cost

Overview

This skill provides a complete, practical guide to implementing vector databases for AI/ML applications such as semantic search, RAG chatbots, and recommendation systems. It emphasizes Qdrant as the primary option while covering Pinecone, Milvus, pgvector, Chroma, and embedding providers. The content focuses on chunking, hybrid search patterns, metadata design, and production evaluation. It is presented for Python-first backends with TypeScript examples for integration.

How this skill works

The skill walks through selecting a vector store based on scale and operational preference, choosing embedding models by quality and cost trade-offs, and designing a full RAG pipeline from ingestion to generation and evaluation. It includes concrete code snippets for Qdrant (Python and TypeScript), chunking defaults, hybrid vector+BM25 search, metadata schemas for filtering, and performance optimizations like batching and caching. It also provides evaluation guidance using RAGAS metrics and troubleshooting steps.

When to use it

Building RAG-enabled chatbots that must ground responses in private docs
Adding semantic search to a documentation site or internal knowledge base
Creating recommendation or similarity systems across text, images, or multimodal content
Implementing code search or deduplication with metadata-aware filters
Scaling to millions of vectors where filtering, re-ranking, and performance tuning matter

Best practices

Choose DB by infrastructure and scale: pgvector for small PostgreSQL setups, Qdrant for flexible mid-scale, Milvus for >100M with GPU
Use 512-token chunks with ~50-token overlap as a default; adapt by content type (code, long documents, multimodal)
Apply metadata pre-filtering to reduce search space and support multi-tenant/ versioned docs
Implement hybrid search (vector + BM25) and reciprocal rank fusion for best recall and precision
Batch embedding generation, cache by content hash, and use MMR or re-ranking to increase result diversity

Example use cases

RAG chatbot: Qdrant + OpenAI/Voyage embeddings, 512-token chunks, hybrid retrieval, LLM re-ranking
Semantic search engine: Voyage embeddings for highest quality, Qdrant or Pinecone, hybrid search and date/category filters
Code search: AST-aware chunking, language/file metadata, vector+keyword hybrid with language filter
Multi-tenant knowledge base: metadata-driven filtering (org/user, product_version) and evaluation with RAGAS metrics

FAQ

Which vector DB should I pick for prototyping?

Use Chroma or an embedded option like LanceDB for fast local prototypes; migrate to Qdrant or Pinecone for production.

What embedding model balances cost and quality?

text-embedding-3-small (OpenAI) offers strong performance at low cost; use text-embedding-3-large or Voyage for top quality when budget allows.

How do I reduce hallucinations in RAG?

Improve chunking, ensure relevant context retrieval with hybrid search, apply re-ranking (cross-encoder or LLM-based), and measure faithfulness with RAGAS.