home / skills / omidzamani / dspy-skills / dspy-evaluation-suite
This skill helps you systematically evaluate DSPy programs with built-in and custom metrics, enabling baseline creation, comparisons, and production readiness.
npx playbooks add skill omidzamani/dspy-skills --skill dspy-evaluation-suiteReview the files below or copy the command above to add this skill to your agents.
---
name: dspy-evaluation-suite
version: "1.0.0"
dspy-compatibility: "3.1.2"
description: This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to systematically test and measure DSPy program quality with custom or built-in metrics.
allowed-tools:
- Read
- Write
- Glob
- Grep
---
# DSPy Evaluation Suite
## Goal
Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution.
## When to Use
- Measuring program performance before/after optimization
- Comparing different program variants
- Establishing baselines
- Validating production readiness
## Related Skills
- Use with any optimizer: [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md), [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md)
- Evaluate RAG pipelines: [dspy-rag-pipeline](../dspy-rag-pipeline/SKILL.md)
## Inputs
| Input | Type | Description |
|-------|------|-------------|
| `program` | `dspy.Module` | Program to evaluate |
| `devset` | `list[dspy.Example]` | Evaluation examples |
| `metric` | `callable` | Scoring function |
| `num_threads` | `int` | Parallel threads |
## Outputs
| Output | Type | Description |
|--------|------|-------------|
| `score` | `float` | Average metric score |
| `results` | `list` | Per-example results |
## Workflow
### Phase 1: Setup Evaluator
```python
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=my_metric,
num_threads=8,
display_progress=True
)
```
### Phase 2: Run Evaluation
```python
result = evaluator(my_program)
print(f"Score: {result.score:.2f}%")
# Access individual results: (example, prediction, score) tuples
for example, pred, score in result.results[:3]:
print(f"Example: {example.question[:50]}... Score: {score}")
```
## Built-in Metrics
### answer_exact_match
```python
import dspy
# Normalized, case-insensitive comparison
metric = dspy.evaluate.answer_exact_match
```
### SemanticF1
LLM-based semantic evaluation:
```python
from dspy.evaluate import SemanticF1
semantic = SemanticF1()
score = semantic(example, prediction)
```
## Custom Metrics
### Basic Metric
```python
def exact_match(example, pred, trace=None):
"""Returns bool, int, or float."""
return example.answer.lower().strip() == pred.answer.lower().strip()
```
### Multi-Factor Metric
```python
def quality_metric(example, pred, trace=None):
"""Score based on multiple factors."""
score = 0.0
# Correctness (50%)
if example.answer.lower() in pred.answer.lower():
score += 0.5
# Conciseness (25%)
if len(pred.answer.split()) <= 20:
score += 0.25
# Has reasoning (25%)
if hasattr(pred, 'reasoning') and pred.reasoning:
score += 0.25
return score
```
### GEPA-Compatible Metric
```python
def feedback_metric(example, pred, trace=None):
"""Returns (score, feedback) for GEPA optimizer."""
correct = example.answer.lower() in pred.answer.lower()
if correct:
return 1.0, "Correct answer provided."
else:
return 0.0, f"Expected '{example.answer}', got '{pred.answer}'"
```
## Production Example
```python
import dspy
from dspy.evaluate import Evaluate, SemanticF1
import json
import logging
from typing import Optional
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class EvaluationResult:
score: float
num_examples: int
correct: int
incorrect: int
errors: int
def comprehensive_metric(example, pred, trace=None) -> float:
"""Multi-dimensional evaluation metric."""
scores = []
# 1. Correctness
if hasattr(example, 'answer') and hasattr(pred, 'answer'):
correct = example.answer.lower().strip() in pred.answer.lower().strip()
scores.append(1.0 if correct else 0.0)
# 2. Completeness (answer not empty or error)
if hasattr(pred, 'answer'):
complete = len(pred.answer.strip()) > 0 and "error" not in pred.answer.lower()
scores.append(1.0 if complete else 0.0)
# 3. Reasoning quality (if available)
if hasattr(pred, 'reasoning'):
has_reasoning = len(str(pred.reasoning)) > 20
scores.append(1.0 if has_reasoning else 0.5)
return sum(scores) / len(scores) if scores else 0.0
class EvaluationSuite:
def __init__(self, devset, num_threads=8):
self.devset = devset
self.num_threads = num_threads
def evaluate(self, program, metric=None) -> EvaluationResult:
"""Run full evaluation with detailed results."""
metric = metric or comprehensive_metric
evaluator = Evaluate(
devset=self.devset,
metric=metric,
num_threads=self.num_threads,
display_progress=True
)
eval_result = evaluator(program)
# Extract individual scores from results
scores = [score for example, pred, score in eval_result.results]
correct = sum(1 for s in scores if s >= 0.5)
errors = sum(1 for s in scores if s == 0)
return EvaluationResult(
score=eval_result.score,
num_examples=len(self.devset),
correct=correct,
incorrect=len(self.devset) - correct - errors,
errors=errors
)
def compare(self, programs: dict, metric=None) -> dict:
"""Compare multiple programs."""
results = {}
for name, program in programs.items():
logger.info(f"Evaluating: {name}")
results[name] = self.evaluate(program, metric)
# Rank by score
ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)
print("\n=== Comparison Results ===")
for rank, (name, result) in enumerate(ranked, 1):
print(f"{rank}. {name}: {result.score:.2%}")
return results
def export_report(self, program, output_path: str, metric=None):
"""Export detailed evaluation report."""
result = self.evaluate(program, metric)
report = {
"summary": {
"score": result.score,
"total": result.num_examples,
"correct": result.correct,
"accuracy": result.correct / result.num_examples
},
"config": {
"num_threads": self.num_threads,
"num_examples": len(self.devset)
}
}
with open(output_path, 'w') as f:
json.dump(report, f, indent=2)
logger.info(f"Report saved to {output_path}")
return report
# Usage
suite = EvaluationSuite(devset, num_threads=8)
# Single evaluation
result = suite.evaluate(my_program)
print(f"Score: {result.score:.2%}")
# Compare variants
results = suite.compare({
"baseline": baseline_program,
"optimized": optimized_program,
"finetuned": finetuned_program
})
```
## Best Practices
1. **Hold out test data** - Never optimize on evaluation set
2. **Multiple metrics** - Combine correctness, quality, efficiency
3. **Statistical significance** - Use enough examples (100+)
4. **Track over time** - Version control evaluation results
## Limitations
- Metrics are task-specific; no universal measure
- SemanticF1 requires LLM calls (cost)
- Parallel evaluation can hit rate limits
- Edge cases may not be captured
## Official Documentation
- **DSPy Documentation**: https://dspy.ai/
- **DSPy GitHub**: https://github.com/stanfordnlp/dspy
- **Evaluation API**: https://dspy.ai/api/evaluation/
- **Metrics Guide**: https://dspy.ai/learn/evaluation/metrics/
This skill provides a systematic evaluation suite for DSPy programs, letting you run parallelized tests with built-in or custom metrics and produce per-example results and aggregate scores. It is designed to measure correctness, quality, and other dimensions of program outputs so you can compare variants, establish baselines, and validate readiness for production. The suite supports exact-match, LLM-based SemanticF1, and user-defined multi-factor metrics.
You create an Evaluate instance with a devset, a metric function, and thread settings, then call it with a DSPy Module to run the evaluation. The evaluator runs each example (in parallel when configured), computes a score per example using the provided metric, and returns an overall average score plus per-example tuples (example, prediction, score). Built-in metrics include answer_exact_match and SemanticF1; you can also pass metrics that return a scalar or (score, feedback) pairs for optimizer integrations.
What inputs are required to run an evaluation?
Provide a dspy.Module (program), a devset list of dspy.Example, and a metric callable. Optionally set num_threads and display options.
Can I use custom metrics?
Yes. Metrics can return a boolean/int/float or (score, feedback) tuples for GEPA-style optimizers; they receive (example, prediction, trace).