home / skills / mrowaisabdullah / ai-humanoid-robotics / rag-pipeline-builder

rag-pipeline-builder skill

/.claude/skills/rag-pipeline-builder

This skill scaffolds a production-ready RAG pipeline without LangChain, enabling fast document ingestion, semantic search, and streaming API endpoints.

npx playbooks add skill mrowaisabdullah/ai-humanoid-robotics --skill rag-pipeline-builder

Review the files below or copy the command above to add this skill to your agents.

Files (10)
SKILL.md
4.1 KB
---
name: rag-pipeline-builder
description: Complete RAG (Retrieval-Augmented Generation) pipeline implementation with document ingestion, vector storage, semantic search, and response generation. Supports FastAPI backends with OpenAI and Qdrant. LangChain-free architecture.
category: backend
version: 2.0.0
---

# RAG Pipeline Builder Skill

## Purpose

Quickly scaffold and implement production-ready RAG systems with a **pure, lightweight stack** (No LangChain):
- Intelligent document chunking (Recursive + Markdown aware)
- Vector embeddings generation (OpenAI SDK)
- Vector storage and retrieval (Qdrant Client)
- Context-aware response generation
- Streaming API endpoints (FastAPI)

## When to Use This Skill

Use this skill when:
- Building high-performance RAG systems without framework overhead.
- Needing full control over the ingestion and retrieval logic.
- Implementing semantic search for technical documentation.

## Core Capabilities

### 1. Lightweight Document Chunking

Uses a custom `RecursiveTextSplitter` implementation that mimics LangChain's logic but without the dependency bloat.

**Strategy:**
1.  **Protect Code Blocks:** Regex replacement ensures code blocks aren't split in the middle.
2.  **Recursive Splitting:** Splits by paragraphs (`\n\n`), then lines (`\n`), then sentences (`. `) to respect document structure.
3.  **Token Counting:** Uses `tiktoken` for accurate sizing compatible with OpenAI models.

**Implementation Template:**
```python
# See scripts/chunking_example.py for complete implementation

class IntelligentChunker:
    """
    Markdown-aware chunking that preserves structure (LangChain-free)
    """
    def __init__(self, chunk_size=1000, overlap=200):
        # ... (uses standalone RecursiveTextSplitter)
```

### 2. Embedding Generation (OpenAI SDK)

Direct usage of `AsyncOpenAI` client for maximum control and performance.

```python
from openai import AsyncOpenAI

class EmbeddingGenerator:
    def __init__(self, api_key: str):
        self.client = AsyncOpenAI(api_key=api_key)

    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        # Direct API call with batching logic
        response = await self.client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        return [item.embedding for item in response.data]
```

### 3. Qdrant Integration (Native Client)

Direct integration with `qdrant-client` for vector operations.

```python
from qdrant_client import QdrantClient

class QdrantManager:
    def upsert_documents(self, documents: list[dict]):
        # Batch upsert logic
        self.client.upsert(
            collection_name=self.collection_name,
            points=points,
        )
```

### 4. FastAPI Streaming Endpoints

Native FastAPI streaming response handling.

```python
from fastapi.responses import StreamingResponse

@app.post("/api/v1/chat")
async def chat_endpoint(request: ChatRequest):
    # ... retrieval logic ...
    
    return StreamingResponse(generate(), media_type="text/plain")
```

## Usage Instructions

### 1. Install Lightweight Dependencies

```bash
pip install -r templates/requirements.txt
```

*(Note: `langchain` is NOT required)*

### 2. Ingest Documents

```bash
# Ingest markdown files using the pure-python ingestor
python scripts/ingest_documents.py docs/ --openai-key $OPENAI_API_KEY
```

### 3. Start API Server

```bash
uvicorn templates.fastapi-endpoint-template:app --reload
```

## Performance Benefits

Removing LangChain provides:
- **Faster Startup:** Reduced import overhead.
- **Smaller Docker Image:** Significantly fewer dependencies.
- **Easier Debugging:** No complex abstraction layers or "Chains" to trace through.
- **Stable API:** You own the logic, immune to framework breaking changes.

## Output Format

When this skill is invoked, provide:
1.  **Complete Pipeline Code** (LangChain-free)
2.  **Configuration File** (.env.example)
3.  **Ingestion Script** (scripts/ingest_documents.py)
4.  **FastAPI Endpoints** (api/routes/chat.py)
5.  **Testing Script** (scripts/test_rag.py)

## Time Savings

**With this skill:** ~45 minutes to generate a highly optimized, custom RAG pipeline without framework lock-in.

Overview

This skill provides a complete, production-ready RAG (Retrieval-Augmented Generation) pipeline implementation that avoids LangChain and other heavy frameworks. It includes markdown-aware document chunking, OpenAI embedding generation, Qdrant vector storage, and FastAPI streaming endpoints for real-time responses. The design emphasizes minimal dependencies, control over ingestion and retrieval logic, and easy deployment.

How this skill works

Documents are preprocessed with a RecursiveTextSplitter that protects code blocks, then split by paragraphs, lines, and sentences to create semantically coherent chunks sized by token counts via tiktoken. Chunks are embedded using OpenAI's AsyncOpenAI client in batched calls and stored in Qdrant using the native qdrant-client. At query time, the system performs semantic similarity search in Qdrant, assembles context windows, and streams generated answers via FastAPI endpoints using a lightweight response generator.

When to use it

  • Building high-performance RAG systems where you need full control over ingestion and retrieval logic.
  • Implementing semantic search for technical documentation, code bases, or markdown-heavy content.
  • Deploying a small-footprint service with faster startup and smaller Docker images.
  • Creating streaming API endpoints for chat-like user experiences with low latency.
  • Preferring explicit, debuggable pipeline steps rather than abstracted frameworks.

Best practices

  • Protect code blocks and other special markup during chunking to preserve meaning and avoid breaking examples.
  • Tune chunk_size and overlap based on your model context window; use tiktoken to count tokens, not characters.
  • Batch embedding calls to OpenAI to reduce latency and cost; handle rate limits and retries.
  • Use metadata on upserted points (source file, position, headings) to trace answers back to originals.
  • Limit retrieved context by relevance score and token budget before sending to the generator to avoid context drift.

Example use cases

  • Enterprise docs search that returns exact code snippets and contextual answers from internal Markdown manuals.
  • Customer support assistant that retrieves relevant KB entries and streams concise answers to users.
  • Developer tooling that searches across code samples, READMEs, and design docs to support pair-programming flows.
  • Small-scale, private deployments where reducing third-party dependency surface is a priority.
  • Prototyping a custom RAG stack quickly and then incrementally optimizing ingestion and retrieval logic.

FAQ

Do I need LangChain or other frameworks?

No. The pipeline is LangChain-free and implements chunking, embedding, storage, and serving directly.

Which embedding model is used?

The templates use OpenAI's text-embedding-3-small by default, with batched AsyncOpenAI calls; you can switch models in configuration.

How do I ensure responses are traceable to source documents?

Store rich metadata (file path, chunk index, headings) with each vector in Qdrant and return these references with answers.