home / skills / omidzamani / dspy-skills / dspy-gepa-reflective

dspy-gepa-reflective skill

/skills/dspy-gepa-reflective

This skill optimizes complex agentic systems with reflective feedback and Pareto-based search to improve multi-step agents using textual traces.

npx playbooks add skill omidzamani/dspy-skills --skill dspy-gepa-reflective

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.0 KB
---
name: dspy-gepa-reflective
version: "1.0.0"
dspy-compatibility: "3.1.2"
description: This skill should be used when the user asks to "optimize an agent with GEPA", "use reflective optimization", "optimize ReAct agents", "provide feedback metrics", mentions "GEPA optimizer", "LLM reflection", "execution trajectories", "agentic systems optimization", or needs to optimize complex multi-step agents using textual feedback on execution traces.
allowed-tools:
  - Read
  - Write
  - Glob
  - Grep
---

# DSPy GEPA Optimizer

## Goal

Optimize complex agentic systems using LLM reflection on full execution traces with Pareto-based evolutionary search.

## When to Use

- **Agentic systems** with tool use
- When you have **rich textual feedback** on failures
- Complex multi-step workflows
- Instruction-only optimization needed

## Related Skills

- For non-agentic programs: [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md)
- Measure improvements: [dspy-evaluation-suite](../dspy-evaluation-suite/SKILL.md)

## Inputs

| Input | Type | Description |
|-------|------|-------------|
| `program` | `dspy.Module` | Agent or complex program |
| `trainset` | `list[dspy.Example]` | Training examples |
| `metric` | `callable` | Must return `(score, feedback)` tuple |
| `reflection_lm` | `dspy.LM` | Strong LM for reflection (GPT-4) |
| `auto` | `str` | "light", "medium", "heavy" |

## Outputs

| Output | Type | Description |
|--------|------|-------------|
| `compiled_program` | `dspy.Module` | Reflectively optimized program |

## Workflow

### Phase 1: Define Feedback Metric

GEPA requires metrics that return *textual feedback*:

```python
def gepa_metric(example, pred, trace=None):
    """Must return (score, feedback) tuple."""
    is_correct = example.answer.lower() in pred.answer.lower()
    
    if is_correct:
        feedback = "Correct. The answer accurately addresses the question."
    else:
        feedback = f"Incorrect. Expected '{example.answer}' but got '{pred.answer}'. The model may have misunderstood the question or retrieved irrelevant information."
    
    return is_correct, feedback
```

### Phase 2: Setup Agent

```python
import dspy

def search(query: str) -> list[str]:
    """Search knowledge base for relevant information."""
    rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
    results = rm(query, k=3)
    return results if isinstance(results, list) else [results]

def calculate(expression: str) -> float:
    """Safely evaluate mathematical expressions."""
    with dspy.PythonInterpreter() as interp:
        return interp(expression)

agent = dspy.ReAct("question -> answer", tools=[search, calculate])
```

### Phase 3: Optimize with GEPA

```python
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),  # Strong model for reflection
    auto="medium"
)

compiled_agent = optimizer.compile(agent, trainset=trainset)
```

## Production Example

```python
import dspy
from dspy.evaluate import Evaluate
import logging

logger = logging.getLogger(__name__)

class ResearchAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "question -> answer",
            tools=[self.search, self.summarize]
        )
    
    def search(self, query: str) -> list[str]:
        """Search for relevant documents."""
        rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
        results = rm(query, k=5)
        return results if isinstance(results, list) else [results]
    
    def summarize(self, text: str) -> str:
        """Summarize long text into key points."""
        summarizer = dspy.Predict("text -> summary")
        return summarizer(text=text).summary
    
    def forward(self, question):
        return self.react(question=question)

def detailed_feedback_metric(example, pred, trace=None):
    """Rich feedback for GEPA reflection."""
    expected = example.answer.lower().strip()
    actual = pred.answer.lower().strip() if pred.answer else ""
    
    # Exact match
    if expected == actual:
        return 1.0, "Perfect match. Answer is correct and concise."
    
    # Partial match
    if expected in actual or actual in expected:
        return 0.7, f"Partial match. Expected '{example.answer}', got '{pred.answer}'. Answer contains correct info but may be verbose or incomplete."
    
    # Check for key terms
    expected_terms = set(expected.split())
    actual_terms = set(actual.split())
    overlap = len(expected_terms & actual_terms) / max(len(expected_terms), 1)
    
    if overlap > 0.5:
        return 0.5, f"Some overlap. Expected '{example.answer}', got '{pred.answer}'. Key terms present but answer structure differs."
    
    return 0.0, f"Incorrect. Expected '{example.answer}', got '{pred.answer}'. The agent may need better search queries or reasoning."

def optimize_research_agent(trainset, devset):
    """Full GEPA optimization pipeline."""
    
    dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
    
    agent = ResearchAgent()
    
    # Convert metric for evaluation (just score)
    def eval_metric(example, pred, trace=None):
        score, _ = detailed_feedback_metric(example, pred, trace)
        return score
    
    evaluator = Evaluate(devset=devset, num_threads=8, metric=eval_metric)
    baseline = evaluator(agent)
    logger.info(f"Baseline: {baseline:.2%}")
    
    # GEPA optimization
    optimizer = dspy.GEPA(
        metric=detailed_feedback_metric,
        reflection_lm=dspy.LM("openai/gpt-4o"),
        auto="medium",
        enable_tool_optimization=True  # Also optimize tool descriptions
    )
    
    compiled = optimizer.compile(agent, trainset=trainset)
    optimized = evaluator(compiled)
    logger.info(f"Optimized: {optimized:.2%}")
    
    compiled.save("research_agent_gepa.json")
    return compiled
```

## Tool Optimization

GEPA can jointly optimize predictor instructions AND tool descriptions:

```python
optimizer = dspy.GEPA(
    metric=gepa_metric,
    reflection_lm=dspy.LM("openai/gpt-4o"),
    auto="medium",
    enable_tool_optimization=True  # Optimize tool docstrings too
)
```

## Best Practices

1. **Rich feedback** - More detailed feedback = better reflection
2. **Strong reflection LM** - Use GPT-4 or Claude for reflection
3. **Agentic focus** - Best for ReAct and multi-tool systems
4. **Trace analysis** - GEPA analyzes full execution trajectories

## Limitations

- Requires custom feedback metrics (not just scores)
- Expensive: uses strong LM for reflection
- Newer optimizer, less battle-tested than MIPROv2
- Best for instruction optimization, less for demos

## Official Documentation

- **DSPy Documentation**: [https://dspy.ai/](https://dspy.ai/)
- **DSPy GitHub**: [https://github.com/stanfordnlp/dspy](https://github.com/stanfordnlp/dspy)
- **GEPA Optimizer**: [https://dspy.ai/api/optimizers/GEPA/](https://dspy.ai/api/optimizers/GEPA/)
- **Agents Guide**: [https://dspy.ai/tutorials/agents/](https://dspy.ai/tutorials/agents/)

Overview

This skill optimizes complex agentic systems by applying GEPA: a reflective, Pareto-based evolutionary optimizer that uses LLM reflection over full execution traces. It targets ReAct and multi-tool agents, improving instructions and tool descriptions using textual feedback from custom metrics. Use it when you need instruction-only or joint tool+instruction optimization backed by rich, trace-level feedback.

How this skill works

The optimizer requires a metric that returns (score, feedback) pairs so a reflection LM can analyze failures in execution trajectories. GEPA runs evolutionary search guided by LLM reflections (a strong model like GPT-4 or Claude) to propose and evaluate instruction and tool-description edits. It compiles a reflectively optimized module that preserves program structure while improving behavior across a training set.

When to use it

  • Optimizing ReAct or multi-tool agents with complex workflows
  • When you have rich textual feedback or can craft feedback-returning metrics
  • Improving instruction prompts and tool docstrings jointly
  • Optimizing systems that produce execution traces for analysis
  • When you need targeted, instruction-level improvements rather than model fine-tuning

Best practices

  • Provide detailed, actionable textual feedback in your metric — the richer the feedback the better GEPA reflects
  • Use a strong reflection LM (GPT-4 class or high-quality Claude) to get reliable reflections
  • Include execution traces in metric callbacks so the optimizer can analyze step-by-step failures
  • Start with auto="light" or "medium" for faster iteration, then escalate to "heavy" for final tuning
  • Enable tool optimization if tool descriptions influence agent behavior

Example use cases

  • Improve a research ReAct agent that searches and summarizes sources, reducing hallucinations and retrieval errors
  • Optimize an instruction-following pipeline where failures are diagnosed via step traces and corrected via prompt edits
  • Jointly refine tool docstrings and reasoning prompts for a multi-tool customer-support agent
  • Use GEPA to run A/B style evolutionary search across prompt variants using textual feedback metrics
  • Run GEPA as part of a development loop: baseline evaluation → GEPA compile → devset evaluation → iterate

FAQ

What kind of metric does GEPA need?

GEPA needs a metric that returns a (score, feedback) tuple. Feedback must be textual and explain failures or partial successes so the reflection LM can reason about corrections.

Which LMs should I use for reflection?

Use a strong, high-capacity model (GPT-4-class or high-quality Claude) for reflection. A lighter LM may be used for policy evaluation, but reflection benefits from a stronger model.

Can GEPA optimize tool behavior as well?

Yes — enabling tool optimization lets GEPA edit tool docstrings and descriptions alongside predictor instructions to improve joint agent behavior.