home / skills / eddiebe147 / claude-settings / rag-pipeline-builder

rag-pipeline-builder skill

/skills/rag-pipeline-builder

This skill guides you to design, implement, and optimize retrieval-augmented generation pipelines grounded in your data.

npx playbooks add skill eddiebe147/claude-settings --skill rag-pipeline-builder

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.0 KB
---
name: RAG Pipeline Builder
slug: rag-pipeline-builder
description: Build retrieval-augmented generation systems that ground LLM responses in your data
category: ai-ml
complexity: advanced
version: "1.0.0"
author: "ID8Labs"
triggers:
  - "build RAG"
  - "retrieval augmented generation"
  - "RAG pipeline"
  - "ground LLM in data"
  - "knowledge base AI"
tags:
  - RAG
  - retrieval
  - LLM
  - embeddings
  - vector-search
---

# RAG Pipeline Builder

The RAG Pipeline Builder skill guides you through designing and implementing Retrieval-Augmented Generation systems that enhance LLM responses with relevant context from your own data. RAG combines the power of large language models with the precision of information retrieval, reducing hallucinations and enabling AI to work with private, current, or domain-specific knowledge.

This skill covers the complete RAG stack: document ingestion, chunking strategies, embedding generation, vector storage, retrieval optimization, context injection, and response generation. It helps you make informed decisions at each stage based on your specific requirements for accuracy, latency, cost, and scale.

Whether you are building a documentation Q&A bot, a customer support system, or an enterprise knowledge assistant, this skill ensures your RAG implementation follows production best practices.

## Core Workflows

### Workflow 1: Design RAG Architecture
1. **Define** requirements:
   - Data sources and formats
   - Query types and patterns
   - Accuracy requirements
   - Latency budget
   - Scale expectations
2. **Choose** components:
   - Document loaders
   - Chunking strategy
   - Embedding model
   - Vector database
   - LLM for generation
   - Reranking layer (optional)
3. **Design** data flow:
   ```
   Documents → Loader → Chunker → Embedder → Vector DB
                                                  ↓
   Query → Embedder → Vector Search → Reranker → Context
                                                  ↓
   Context + Query → LLM → Response
   ```
4. **Document** architecture decisions

### Workflow 2: Implement Ingestion Pipeline
1. **Set up** document loaders:
   - PDF, Markdown, HTML parsers
   - API connectors for live sources
   - Incremental update handling
2. **Implement** chunking:
   ```python
   def smart_chunk(doc, chunk_size=500, overlap=50):
       # Respect document structure
       sections = extract_sections(doc)
       chunks = []
       for section in sections:
           if len(section) > chunk_size:
               chunks.extend(sliding_window(section, chunk_size, overlap))
           else:
               chunks.append(section)
       return add_metadata(chunks, doc)
   ```
3. **Generate** embeddings with batching
4. **Store** in vector database with metadata
5. **Verify** ingestion quality

### Workflow 3: Optimize Retrieval Quality
1. **Measure** baseline retrieval performance:
   - Recall@k for known queries
   - Mean Reciprocal Rank (MRR)
   - Relevance scoring
2. **Apply** optimization techniques:
   - Query expansion/rewriting
   - Hybrid search (semantic + keyword)
   - Reranking with cross-encoders
   - Metadata filtering
3. **Tune** retrieval parameters:
   - Number of chunks to retrieve (k)
   - Similarity threshold
   - Diversity/MMR settings
4. **Validate** improvements with test set

## Quick Reference

| Action | Command/Trigger |
|--------|-----------------|
| Design RAG system | "Help me design a RAG pipeline for [use case]" |
| Choose vector DB | "Which vector database for RAG" |
| Optimize chunking | "Best chunking strategy for [content type]" |
| Improve retrieval | "My RAG has poor retrieval quality" |
| Reduce hallucinations | "RAG still hallucinating, help fix" |
| Scale pipeline | "Scale RAG to [X] documents" |

## Best Practices

- **Chunk at Semantic Boundaries**: Preserve meaning in chunks
  - Good: Split at paragraphs, sections, or topic boundaries
  - Bad: Fixed-size splits that cut sentences mid-thought
  - Include section headers as context in chunks

- **Include Rich Metadata**: Enable filtering and context
  - Source document, section, page number
  - Timestamps for temporal relevance
  - Categories, tags, or topics
  - Use metadata filters before semantic search

- **Use Hybrid Search**: Combine semantic and keyword search
  - Semantic: Captures meaning and synonyms
  - Keyword (BM25): Catches exact terms, names, codes
  - Weight combination based on query type

- **Rerank for Quality**: Two-stage retrieval improves precision
  - Stage 1: Fast vector search (retrieve 20-50)
  - Stage 2: Cross-encoder reranking (keep top 5-10)
  - Reranking is slower but much more accurate

- **Show Your Work**: Include citations and sources
  - Return source chunks with responses
  - Enable users to verify and explore
  - Build trust through transparency

- **Handle Edge Cases**: What happens when retrieval fails?
  - No relevant results found
  - Conflicting information in sources
  - Query outside knowledge base scope
  - Implement graceful fallbacks

## Advanced Techniques

### Multi-Index Strategy
Use different indexes for different content types:
```
Index 1: FAQs (short, self-contained)
Index 2: Documentation (long-form, structured)
Index 3: Conversations (temporal, contextual)

Route queries to appropriate index based on intent
```

### Query Transformation Pipeline
Improve retrieval with query processing:
```python
def transform_query(query):
    # Step 1: Classify query type
    query_type = classify_query(query)

    # Step 2: Extract entities
    entities = extract_entities(query)

    # Step 3: Generate search queries
    if query_type == "factual":
        return generate_keyword_queries(query, entities)
    elif query_type == "conceptual":
        return generate_semantic_queries(query)
    else:
        return [query]  # Use as-is
```

### Contextual Compression
Reduce noise in retrieved context:
```
Retrieved chunks (verbose) → LLM compressor → Relevant excerpts only
```

### Agentic RAG
Let the LLM control retrieval:
```python
def agentic_rag(query):
    # LLM decides what to search for
    search_plan = llm.plan_searches(query)

    # Execute searches
    results = []
    for search in search_plan:
        results.extend(retriever.search(search.query, filters=search.filters))

    # LLM synthesizes answer
    return llm.synthesize(query, results)
```

### Evaluation Framework
Continuously measure RAG quality:
```
Metrics:
- Retrieval: Precision@k, Recall@k, MRR
- Generation: Faithfulness, Answer Relevance, Context Utilization
- End-to-end: Task Success Rate, User Satisfaction

Tools: Ragas, TruLens, LangSmith
```

## Common Pitfalls to Avoid

- Chunking too large (loses specificity) or too small (loses context)
- Not preserving document structure and hierarchy in chunks
- Ignoring keyword search when exact matches matter
- Retrieving too few chunks (missing information) or too many (context dilution)
- Not handling conflicting information across sources
- Assuming LLM will always use retrieved context correctly
- Skipping evaluation and monitoring in production
- Not updating embeddings when source documents change

Overview

This skill teaches you how to design and implement Retrieval-Augmented Generation (RAG) pipelines that ground LLM responses in your data. It covers the full stack from ingestion and chunking to embeddings, vector storage, retrieval optimization, reranking, and context injection for reliable, verifiable answers. Use it to reduce hallucinations and make LLMs work with private, current, or domain-specific knowledge.

How this skill works

The skill walks through building a data flow: load documents, chunk at semantic boundaries, generate embeddings, and store vectors with rich metadata. At query time the pipeline embeds the query, performs vector (and optional keyword) search, reranks results with a cross-encoder, and injects the most relevant context into the LLM prompt. It also includes techniques for evaluation, query transformation, context compression, multi-index routing, and agentic retrieval where the LLM plans searches.

When to use it

  • You need LLM answers grounded in internal or proprietary documents
  • You want to reduce hallucinations and provide citations
  • You must support domain-specific or frequently changing knowledge
  • You need to scale search across many document types and formats
  • You want higher-precision responses via reranking or hybrid search

Best practices

  • Chunk at semantic boundaries; preserve sections and headers as context
  • Include rich metadata (source, section, timestamps, tags) and use filters before search
  • Use hybrid search (semantic + keyword) when exact terms matter
  • Apply two-stage retrieval: fast vector search then cross-encoder rerank
  • Measure retrieval metrics (Recall@k, MRR) and generation faithfulness continuously
  • Handle edge cases: no results, conflicting sources, and OOV queries with graceful fallbacks

Example use cases

  • Documentation Q&A bot that returns source citations and page links
  • Customer support assistant that searches manuals, tickets, and KB articles
  • Enterprise knowledge assistant that combines policy, contracts, and product data
  • Compliance and audit assistant that finds provenance for statements
  • Conversational agent that routes queries to multiple indexes (FAQs, docs, conversations)

FAQ

How many chunks should I retrieve (k)?

Start with 20–50 for initial retrieval then rerank to top 5–10; tune based on latency, MRR, and prompt token budget.

When should I use reranking?

Use reranking when you need higher precision or when initial vector search returns noisy or closely scored results; accept higher CPU/latency cost for better accuracy.

How do I keep embeddings up to date?

Detect document changes and re-embed updated chunks; use incremental ingestion and version timestamps in metadata to refresh affected vectors.