home / skills / omidzamani / dspy-skills / dspy-evaluation-suite

dspy-evaluation-suite skill

/skills/dspy-evaluation-suite

This skill helps you systematically evaluate DSPy programs with built-in and custom metrics, enabling baseline creation, comparisons, and production readiness.

npx playbooks add skill omidzamani/dspy-skills --skill dspy-evaluation-suite

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.7 KB
---
name: dspy-evaluation-suite
version: "1.0.0"
dspy-compatibility: "3.1.2"
description: This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to systematically test and measure DSPy program quality with custom or built-in metrics.
allowed-tools:
  - Read
  - Write
  - Glob
  - Grep
---

# DSPy Evaluation Suite

## Goal

Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution.

## When to Use

- Measuring program performance before/after optimization
- Comparing different program variants
- Establishing baselines
- Validating production readiness

## Related Skills

- Use with any optimizer: [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md), [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md)
- Evaluate RAG pipelines: [dspy-rag-pipeline](../dspy-rag-pipeline/SKILL.md)

## Inputs

| Input | Type | Description |
|-------|------|-------------|
| `program` | `dspy.Module` | Program to evaluate |
| `devset` | `list[dspy.Example]` | Evaluation examples |
| `metric` | `callable` | Scoring function |
| `num_threads` | `int` | Parallel threads |

## Outputs

| Output | Type | Description |
|--------|------|-------------|
| `score` | `float` | Average metric score |
| `results` | `list` | Per-example results |

## Workflow

### Phase 1: Setup Evaluator

```python
from dspy.evaluate import Evaluate

evaluator = Evaluate(
    devset=devset,
    metric=my_metric,
    num_threads=8,
    display_progress=True
)
```

### Phase 2: Run Evaluation

```python
result = evaluator(my_program)
print(f"Score: {result.score:.2f}%")
# Access individual results: (example, prediction, score) tuples
for example, pred, score in result.results[:3]:
    print(f"Example: {example.question[:50]}... Score: {score}")
```

## Built-in Metrics

### answer_exact_match

```python
import dspy

# Normalized, case-insensitive comparison
metric = dspy.evaluate.answer_exact_match
```

### SemanticF1

LLM-based semantic evaluation:

```python
from dspy.evaluate import SemanticF1

semantic = SemanticF1()
score = semantic(example, prediction)
```

## Custom Metrics

### Basic Metric

```python
def exact_match(example, pred, trace=None):
    """Returns bool, int, or float."""
    return example.answer.lower().strip() == pred.answer.lower().strip()
```

### Multi-Factor Metric

```python
def quality_metric(example, pred, trace=None):
    """Score based on multiple factors."""
    score = 0.0
    
    # Correctness (50%)
    if example.answer.lower() in pred.answer.lower():
        score += 0.5
    
    # Conciseness (25%)
    if len(pred.answer.split()) <= 20:
        score += 0.25
    
    # Has reasoning (25%)
    if hasattr(pred, 'reasoning') and pred.reasoning:
        score += 0.25
    
    return score
```

### GEPA-Compatible Metric

```python
def feedback_metric(example, pred, trace=None):
    """Returns (score, feedback) for GEPA optimizer."""
    correct = example.answer.lower() in pred.answer.lower()
    
    if correct:
        return 1.0, "Correct answer provided."
    else:
        return 0.0, f"Expected '{example.answer}', got '{pred.answer}'"
```

## Production Example

```python
import dspy
from dspy.evaluate import Evaluate, SemanticF1
import json
import logging
from typing import Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)

@dataclass
class EvaluationResult:
    score: float
    num_examples: int
    correct: int
    incorrect: int
    errors: int

def comprehensive_metric(example, pred, trace=None) -> float:
    """Multi-dimensional evaluation metric."""
    scores = []
    
    # 1. Correctness
    if hasattr(example, 'answer') and hasattr(pred, 'answer'):
        correct = example.answer.lower().strip() in pred.answer.lower().strip()
        scores.append(1.0 if correct else 0.0)
    
    # 2. Completeness (answer not empty or error)
    if hasattr(pred, 'answer'):
        complete = len(pred.answer.strip()) > 0 and "error" not in pred.answer.lower()
        scores.append(1.0 if complete else 0.0)
    
    # 3. Reasoning quality (if available)
    if hasattr(pred, 'reasoning'):
        has_reasoning = len(str(pred.reasoning)) > 20
        scores.append(1.0 if has_reasoning else 0.5)
    
    return sum(scores) / len(scores) if scores else 0.0

class EvaluationSuite:
    def __init__(self, devset, num_threads=8):
        self.devset = devset
        self.num_threads = num_threads
    
    def evaluate(self, program, metric=None) -> EvaluationResult:
        """Run full evaluation with detailed results."""
        metric = metric or comprehensive_metric

        evaluator = Evaluate(
            devset=self.devset,
            metric=metric,
            num_threads=self.num_threads,
            display_progress=True
        )

        eval_result = evaluator(program)

        # Extract individual scores from results
        scores = [score for example, pred, score in eval_result.results]
        correct = sum(1 for s in scores if s >= 0.5)
        errors = sum(1 for s in scores if s == 0)

        return EvaluationResult(
            score=eval_result.score,
            num_examples=len(self.devset),
            correct=correct,
            incorrect=len(self.devset) - correct - errors,
            errors=errors
        )
    
    def compare(self, programs: dict, metric=None) -> dict:
        """Compare multiple programs."""
        results = {}
        
        for name, program in programs.items():
            logger.info(f"Evaluating: {name}")
            results[name] = self.evaluate(program, metric)
        
        # Rank by score
        ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)
        
        print("\n=== Comparison Results ===")
        for rank, (name, result) in enumerate(ranked, 1):
            print(f"{rank}. {name}: {result.score:.2%}")
        
        return results
    
    def export_report(self, program, output_path: str, metric=None):
        """Export detailed evaluation report."""
        result = self.evaluate(program, metric)
        
        report = {
            "summary": {
                "score": result.score,
                "total": result.num_examples,
                "correct": result.correct,
                "accuracy": result.correct / result.num_examples
            },
            "config": {
                "num_threads": self.num_threads,
                "num_examples": len(self.devset)
            }
        }
        
        with open(output_path, 'w') as f:
            json.dump(report, f, indent=2)
        
        logger.info(f"Report saved to {output_path}")
        return report

# Usage
suite = EvaluationSuite(devset, num_threads=8)

# Single evaluation
result = suite.evaluate(my_program)
print(f"Score: {result.score:.2%}")

# Compare variants
results = suite.compare({
    "baseline": baseline_program,
    "optimized": optimized_program,
    "finetuned": finetuned_program
})
```

## Best Practices

1. **Hold out test data** - Never optimize on evaluation set
2. **Multiple metrics** - Combine correctness, quality, efficiency
3. **Statistical significance** - Use enough examples (100+)
4. **Track over time** - Version control evaluation results

## Limitations

- Metrics are task-specific; no universal measure
- SemanticF1 requires LLM calls (cost)
- Parallel evaluation can hit rate limits
- Edge cases may not be captured

## Official Documentation

- **DSPy Documentation**: https://dspy.ai/
- **DSPy GitHub**: https://github.com/stanfordnlp/dspy
- **Evaluation API**: https://dspy.ai/api/evaluation/
- **Metrics Guide**: https://dspy.ai/learn/evaluation/metrics/

Overview

This skill provides a systematic evaluation suite for DSPy programs, letting you run parallelized tests with built-in or custom metrics and produce per-example results and aggregate scores. It is designed to measure correctness, quality, and other dimensions of program outputs so you can compare variants, establish baselines, and validate readiness for production. The suite supports exact-match, LLM-based SemanticF1, and user-defined multi-factor metrics.

How this skill works

You create an Evaluate instance with a devset, a metric function, and thread settings, then call it with a DSPy Module to run the evaluation. The evaluator runs each example (in parallel when configured), computes a score per example using the provided metric, and returns an overall average score plus per-example tuples (example, prediction, score). Built-in metrics include answer_exact_match and SemanticF1; you can also pass metrics that return a scalar or (score, feedback) pairs for optimizer integrations.

When to use it

  • Measuring performance before and after prompt or model optimization
  • Comparing multiple DSPy program variants to select a best candidate
  • Establishing baseline scores for tracking regressions
  • Validating production readiness with a held-out test set
  • Running automated evaluations as part of an optimization loop

Best practices

  • Keep a held-out test/dev set and never tune on evaluation data
  • Combine multiple metrics (correctness, conciseness, reasoning quality)
  • Use sufficient sample sizes (100+ examples) for statistical confidence
  • Monitor cost when using LLM-based metrics like SemanticF1
  • Run evaluations in parallel but watch for API rate limits

Example use cases

  • Run answer_exact_match across a 1k-example devset to get a baseline accuracy
  • Compare baseline, optimized, and fine-tuned program variants and rank by score
  • Create a multi-factor metric that weights correctness, brevity, and reasoning
  • Use GEPA-compatible metrics to feed back score+feedback into an optimizer
  • Export JSON reports summarizing score, counts, and evaluation config for audits

FAQ

What inputs are required to run an evaluation?

Provide a dspy.Module (program), a devset list of dspy.Example, and a metric callable. Optionally set num_threads and display options.

Can I use custom metrics?

Yes. Metrics can return a boolean/int/float or (score, feedback) tuples for GEPA-style optimizers; they receive (example, prediction, trace).