home / skills / a5c-ai / babysitter / haystack-pipeline

This skill helps you configure Haystack pipelines for document processing and QA, including stores, retrievers, readers, and preprocessing with best practices.

npx playbooks add skill a5c-ai/babysitter --skill haystack-pipeline

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
1.2 KB
---
name: haystack-pipeline
description: Haystack NLP pipeline configuration for document processing and QA
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
---

# Haystack Pipeline Skill

## Capabilities

- Configure Haystack pipeline components
- Set up document stores and retrievers
- Implement reader/generator models
- Design custom pipeline graphs
- Configure preprocessing pipelines
- Implement evaluation pipelines

## Target Processes

- rag-pipeline-implementation
- intent-classification-system

## Implementation Details

### Core Components

1. **DocumentStores**: Elasticsearch, Weaviate, FAISS, etc.
2. **Retrievers**: BM25, Dense, Hybrid
3. **Readers/Generators**: Extractive and generative QA
4. **Preprocessors**: Document cleaning and splitting

### Pipeline Types

- Retrieval pipelines
- RAG pipelines
- Evaluation pipelines
- Indexing pipelines

### Configuration Options

- Component selection
- Pipeline graph design
- Document store backend
- Model selection
- Preprocessing settings

### Best Practices

- Modular pipeline design
- Proper preprocessing
- Evaluation integration
- Component versioning

### Dependencies

- haystack-ai
- farm-haystack (legacy)

Overview

This skill configures Haystack NLP pipelines for document processing and question-answering workflows. It helps assemble document stores, retrievers, and reader/generator models into modular, reproducible pipelines tailored for RAG, retrieval, and evaluation tasks. The skill emphasizes practical configuration, preprocessing, and evaluation integration for production-grade NLP stacks.

How this skill works

You define and connect core Haystack components: a document store backend, retriever(s), reader or generator, and optional preprocessors. The skill builds pipeline graphs for retrieval, RAG, indexing, or evaluation, and exposes configuration knobs for component selection, model choice, and preprocessing rules. It supports modular, versioned pipelines that can be evaluated and iterated deterministically.

When to use it

  • Building a retrieval-augmented generation (RAG) service for QA over internal documents
  • Indexing and searching large document collections with a chosen backend (FAISS, Weaviate, Elasticsearch)
  • Prototyping or deploying hybrid retrieval (BM25 + dense) systems
  • Creating evaluation pipelines to compare reader/generator model performance
  • Integrating robust preprocessing (cleaning, splitting) before indexing or retrieval

Best practices

  • Design modular pipeline graphs so components can be swapped without changing the whole workflow
  • Normalize and split documents during preprocessing to improve retrieval relevance
  • Use hybrid retrievers (BM25 + dense) when semantic and lexical signals matter
  • Version component configurations (models, retrievers, stores) to reproduce experiments
  • Integrate evaluation early using held-out queries and metrics to prevent regressions

Example use cases

  • Implement a RAG pipeline that retrieves passages from Weaviate and generates answers with a generative model
  • Set up an indexing pipeline that preprocesses PDFs, extracts text, and stores embeddings in FAISS
  • Create an intent-classification augmentation where retrieved context improves classification accuracy
  • Run evaluation pipelines to compare extractive readers versus generative answerers on a QA benchmark
  • Deploy a production retrieval service with Elasticsearch backend and BM25 fallback

FAQ

Which document stores are supported?

Common backends include Elasticsearch, Weaviate, FAISS, and other Haystack-compatible stores; choose based on scale and retrieval latency needs.

Should I use an extractive reader or a generative model?

Use extractive readers for precise span-based answers and generative models when fluency and synthesis across passages are required; evaluation should guide the final choice.