home / skills / omidzamani / dspy-skills / dspy-bootstrap-fewshot

dspy-bootstrap-fewshot skill

/skills/dspy-bootstrap-fewshot

This skill automatically generates and selects effective few-shot demonstrations for DSPy programs using a teacher model with limited data.

npx playbooks add skill omidzamani/dspy-skills --skill dspy-bootstrap-fewshot

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.2 KB
---
name: dspy-bootstrap-fewshot
version: "1.0.0"
dspy-compatibility: "3.1.2"
description: This skill should be used when the user asks to "bootstrap few-shot examples", "generate demonstrations", "use BootstrapFewShot", "optimize with limited data", "create training demos automatically", mentions "teacher model for few-shot", "10-50 training examples", or wants automatic demonstration generation for a DSPy program without extensive compute.
allowed-tools:
  - Read
  - Write
  - Glob
  - Grep
---

# DSPy Bootstrap Few-Shot Optimizer

## Goal

Automatically generate and select optimal few-shot demonstrations for your DSPy program using a teacher model.

## When to Use

- You have **10-50 labeled examples**
- Manual example selection is tedious or suboptimal
- You want demonstrations with reasoning traces
- Quick optimization without extensive compute

## Related Skills

- For more data (200+ examples): [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md)
- For agentic systems: [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md)
- Measure improvements: [dspy-evaluation-suite](../dspy-evaluation-suite/SKILL.md)

## Inputs

| Input | Type | Description |
|-------|------|-------------|
| `program` | `dspy.Module` | Your DSPy program to optimize |
| `trainset` | `list[dspy.Example]` | Training examples |
| `metric` | `callable` | Evaluation function |
| `metric_threshold` | `float` | Numerical threshold for accepting demos (optional) |
| `max_bootstrapped_demos` | `int` | Max teacher-generated demos (default: 4) |
| `max_labeled_demos` | `int` | Max direct labeled demos (default: 16) |
| `max_rounds` | `int` | Max bootstrapping attempts per example (default: 1) |
| `teacher_settings` | `dict` | Configuration for teacher model (optional) |

## Outputs

| Output | Type | Description |
|--------|------|-------------|
| `compiled_program` | `dspy.Module` | Optimized program with demos |

## Workflow

### Phase 1: Setup

```python
import dspy
from dspy.teleprompt import BootstrapFewShot

# Configure LMs
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
```

### Phase 2: Define Program and Metric

```python
class QA(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.generate(question=question)

def validate_answer(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()
```

### Phase 3: Compile

```python
optimizer = BootstrapFewShot(
    metric=validate_answer,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
    teacher_settings={'lm': dspy.LM("openai/gpt-4o")}
)

compiled_qa = optimizer.compile(QA(), trainset=trainset)
```

### Phase 4: Use and Save

```python
# Use optimized program
result = compiled_qa(question="What is photosynthesis?")

# Save for production (state-only, recommended)
compiled_qa.save("qa_optimized.json", save_program=False)
```

## Production Example

```python
import dspy
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import Evaluate
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductionQA(dspy.Module):
    def __init__(self):
        self.cot = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question: str):
        try:
            return self.cot(question=question)
        except Exception as e:
            logger.error(f"Generation failed: {e}")
            return dspy.Prediction(answer="Unable to answer")

def robust_metric(example, pred, trace=None):
    if not pred.answer or pred.answer == "Unable to answer":
        return 0.0
    return float(example.answer.lower() in pred.answer.lower())

def optimize_with_bootstrap(trainset, devset):
    """Full optimization pipeline with validation."""
    
    # Baseline
    baseline = ProductionQA()
    evaluator = Evaluate(devset=devset, metric=robust_metric, num_threads=4)
    baseline_score = evaluator(baseline)
    logger.info(f"Baseline: {baseline_score:.2%}")
    
    # Optimize
    optimizer = BootstrapFewShot(
        metric=robust_metric,
        max_bootstrapped_demos=4,
        max_labeled_demos=4
    )
    
    compiled = optimizer.compile(baseline, trainset=trainset)
    optimized_score = evaluator(compiled)
    logger.info(f"Optimized: {optimized_score:.2%}")
    
    if optimized_score > baseline_score:
        compiled.save("production_qa.json", save_program=False)
        return compiled
    
    logger.warning("Optimization didn't improve; keeping baseline")
    return baseline
```

## Best Practices

1. **Quality over quantity** - 10 excellent examples beat 100 noisy ones
2. **Use stronger teacher** - GPT-4 as teacher for GPT-3.5 student
3. **Validate with held-out set** - Always test on unseen data
4. **Start with 4 demos** - More isn't always better

## Limitations

- Requires labeled training data
- Teacher model costs can add up
- May not generalize to very different inputs
- Limited exploration compared to MIPROv2

## Official Documentation

- **DSPy Documentation**: https://dspy.ai/
- **DSPy GitHub**: https://github.com/stanfordnlp/dspy
- **BootstrapFewShot API**: https://dspy.ai/api/optimizers/BootstrapFewShot/
- **Optimization Guide**: https://dspy.ai/learn/optimization/optimizers/

Overview

This skill automates generation and selection of few-shot demonstrations for a DSPy program using a teacher model. It helps produce high-quality training demos and reasoning traces from a limited labeled set (typically 10–50 examples) to improve program performance quickly without heavy compute.

How this skill works

The optimizer uses a stronger teacher model to generate candidate demonstrations and evaluates them against a provided metric. It selects and compiles the best mix of teacher-generated and labeled demos into a DSPy Module, optionally limiting counts per configuration and validating improvements on held-out data.

When to use it

  • You have 10–50 labeled examples and want to boost performance
  • Manual selection of demonstrations is tedious or inconsistent
  • You need demonstrations that include chain-of-thought or reasoning traces
  • You want a quick optimization pass without large-scale compute
  • You plan to validate improvements with a held-out dev set

Best practices

  • Prefer higher-quality labeled examples over larger noisy sets
  • Use a stronger teacher model (e.g., GPT-4 family) for generation
  • Start with 4 demonstrations and evaluate before increasing
  • Provide a robust metric and held-out dev set for validation
  • Limit teacher-generated demos to control cost and overfitting

Example use cases

  • Bootstrapping few-shot prompts for a QA DSPy program with 20 labeled examples
  • Generating chain-of-thought demonstrations for a reasoning task
  • Improving a model iteratively by validating compiled programs on a dev set
  • Automating demo selection when manual curation is time-consuming
  • Quick optimization for production workflows where compute budget is limited

FAQ

How many labeled examples do I need?

This skill is designed for small labeled sets—typically 10–50 examples; quality matters more than quantity.

Can I control teacher model cost and output count?

Yes. Configure teacher_settings and limits like max_bootstrapped_demos and max_rounds to balance cost and coverage.