home / skills / omidzamani / dspy-skills / dspy-simba-optimizer

dspy-simba-optimizer skill

/skills/dspy-simba-optimizer

This skill optimizes DSPy programs using mini-batch Bayesian optimization with rich feedback signals to balance efficiency and performance.

npx playbooks add skill omidzamani/dspy-skills --skill dspy-simba-optimizer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
8.0 KB
---
name: dspy-simba-optimizer
version: "1.0.0"
dspy-compatibility: "3.1.2"
description: This skill should be used when the user asks to "optimize with SIMBA", "use Bayesian optimization", "optimize agents with custom feedback", mentions "SIMBA optimizer", "mini-batch optimization", "statistical optimization", "lightweight optimizer", or needs an alternative to MIPROv2/GEPA for programs with rich feedback signals.
allowed-tools:
  - Read
  - Write
  - Glob
  - Grep
---

# DSPy SIMBA Optimizer

## Goal

Optimize DSPy programs using mini-batch Bayesian optimization with statistical analysis of feedback signals.

## When to Use

- Need lighter-weight alternative to GEPA
- Have custom feedback metrics (not just accuracy)
- Agentic tasks with rich failure signals
- Budget-conscious optimization (fewer eval calls)
- Programs where few-shot examples aren't critical

## Related Skills

- Alternative optimizers: [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md)
- Agent optimization: [dspy-react-agent-builder](../dspy-react-agent-builder/SKILL.md)
- Evaluation: [dspy-evaluation-suite](../dspy-evaluation-suite/SKILL.md)

## Inputs

| Input | Type | Description |
|-------|------|-------------|
| `program` | `dspy.Module` | Program to optimize |
| `trainset` | `list[dspy.Example]` | Training examples |
| `metric` | `callable` | Returns float or `dspy.Prediction(score=..., feedback=...)` |
| `max_steps` | `int` | Number of optimization steps |
| `bsize` | `int` | Mini-batch size |

## Outputs

| Output | Type | Description |
|--------|------|-------------|
| `optimized_program` | `dspy.Module` | SIMBA-optimized program |

## Workflow

### Phase 1: Understand SIMBA

**SIMBA** (Stochastic Introspective Mini-Batch Ascent):
- Iterative prompt optimization with mini-batch sampling
- Identifies challenging examples with high output variability
- Generates self-reflective rules or adds successful demonstrations
- Lighter than GEPA (no reflection LM)
- More flexible than Bootstrap (uses feedback)

**Comparison:**
- **MIPROv2**: Best accuracy, lots of data
- **GEPA**: Agentic systems, expensive
- **SIMBA**: Custom feedback, budget-friendly
- **Bootstrap**: Simplest, demo-based

### Phase 2: Basic SIMBA Optimization

```python
import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

# Program to optimize
class QAPipeline(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.generate(question=question)

# Metric (can return just score or (score, feedback))
def qa_metric(example, pred, trace=None):
    correct = example.answer.lower() in pred.answer.lower()
    return 1.0 if correct else 0.0

# SIMBA optimizer
optimizer = dspy.SIMBA(
    metric=qa_metric,
    max_steps=10,  # Optimization iterations
    bsize=5  # Mini-batch size
)

program = QAPipeline()
compiled = optimizer.compile(program, trainset=trainset)
compiled.save("qa_simba.json")
```

### Phase 3: SIMBA with Feedback Signals

SIMBA works best with rich feedback:

```python
import dspy

def detailed_metric(example, pred, trace=None):
    """Metric with feedback signal."""
    expected = example.answer.lower()
    actual = pred.answer.lower()

    if expected == actual:
        return dspy.Prediction(score=1.0, feedback="Perfect match")
    elif expected in actual:
        return dspy.Prediction(score=0.7, feedback=f"Contains answer but verbose: '{actual}'")
    else:
        overlap = len(set(expected.split()) & set(actual.split()))
        if overlap > 0:
            return dspy.Prediction(score=0.3, feedback=f"Partial overlap: {overlap} words")
        return dspy.Prediction(score=0.0, feedback=f"No match. Expected '{expected}'")

optimizer = dspy.SIMBA(
    metric=detailed_metric,
    max_steps=20,  # Optimization iterations
    bsize=8  # Mini-batch size
)

compiled = optimizer.compile(program, trainset=trainset)
```

### Phase 4: Production Agent Optimization

```python
import dspy
from dspy.evaluate import Evaluate
import logging

logger = logging.getLogger(__name__)

# Define tools as functions
def search(query: str) -> str:
    """Search knowledge base for relevant information."""
    retriever = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
    results = retriever(query, k=3)
    return "\n".join([r['text'] for r in results])

def calculate(expr: str) -> str:
    """Evaluate Python expressions safely."""
    try:
        with dspy.PythonInterpreter() as interp:
            return str(interp.execute(expr))
    except Exception as e:
        return f"Error: {e}"

class ResearchAgent(dspy.Module):
    def __init__(self):
        self.agent = dspy.ReAct(
            "question -> answer",
            tools=[search, calculate]
        )

    def forward(self, question):
        return self.agent(question=question)

def agent_metric(example, pred, trace=None):
    """Rich metric for agent optimization."""
    expected = example.answer.lower().strip()
    actual = pred.answer.lower().strip() if pred.answer else ""

    # Exact match
    if expected == actual:
        return dspy.Prediction(score=1.0, feedback="Correct answer")

    # Partial match
    if expected in actual:
        return dspy.Prediction(score=0.7, feedback="Answer contains expected result")

    # Check key terms
    expected_terms = set(expected.split())
    actual_terms = set(actual.split())
    overlap = len(expected_terms & actual_terms)

    if overlap >= len(expected_terms) * 0.5:
        return dspy.Prediction(score=0.5, feedback=f"50%+ term overlap")

    return dspy.Prediction(score=0.0, feedback=f"Incorrect: expected '{example.answer}'")

def optimize_agent(trainset, devset):
    """Full SIMBA optimization pipeline."""
    dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

    agent = ResearchAgent()

    # Baseline evaluation
    eval_metric = lambda ex, pred, trace: agent_metric(ex, pred, trace).score
    evaluator = dspy.Evaluate(devset=devset, metric=eval_metric, num_threads=4)
    baseline = evaluator(agent)
    logger.info(f"Baseline: {baseline:.2%}")

    # SIMBA optimization
    optimizer = dspy.SIMBA(
        metric=agent_metric,
        max_steps=25,  # Optimization iterations
        bsize=6  # Mini-batch size
    )

    compiled = optimizer.compile(agent, trainset=trainset)

    # Evaluate optimized
    optimized = evaluator(compiled)
    logger.info(f"SIMBA optimized: {optimized:.2%}")

    compiled.save("research_agent_simba.json")
    return compiled
```

## Configuration

```python
optimizer = dspy.SIMBA(
    metric=metric_fn,
    max_steps=20,                          # Optimization iterations
    bsize=32,                              # Mini-batch size (default: 32)
    num_candidates=6,                      # Candidates per iteration (default: 6)
    max_demos=4,                           # Max demos per predictor (default: 4)
    temperature_for_sampling=0.2,          # Sampling temperature (default: 0.2)
    temperature_for_candidates=0.2         # Candidate selection temperature (default: 0.2)
)
```

## Best Practices

1. **Use feedback signals** - SIMBA benefits from `dspy.Prediction(score=..., feedback=...)` objects
2. **Balance parameters** - Adjust `bsize` (default 32) and `max_steps` (default 8) based on dataset size
3. **Patience** - SIMBA is slower than Bootstrap, faster than GEPA
4. **Custom metrics** - Best for scenarios with nuanced scoring (not binary)
5. **Tune temperatures** - Lower temperatures (0.1-0.3) for exploitation, higher (0.5-1.0) for exploration

## Limitations

- Newer optimizer, less battle-tested than MIPROv2
- Requires thoughtful metric design (garbage in, garbage out)
- Not as thorough as GEPA for agent optimization
- Mini-batch sampling adds variance to results
- No automatic prompt reflection like GEPA

## Official Documentation

- **DSPy Documentation**: https://dspy.ai/
- **DSPy GitHub**: https://github.com/stanfordnlp/dspy
- **SIMBA Optimizer**: https://dspy.ai/api/optimizers/SIMBA/
- **Optimizers Guide**: https://dspy.ai/learn/optimization/optimizers/

Overview

This skill provides a lightweight mini-batch Bayesian optimizer (SIMBA) for DSPy programs. It focuses on optimizing prompts and agent pipelines using custom feedback signals rather than only accuracy. SIMBA is designed as a budget-friendly alternative to heavier methods like GEPA while supporting nuanced metric design.

How this skill works

SIMBA runs iterative mini-batch optimization: it samples batches, evaluates candidates with a user-provided metric, and proposes prompt edits or demonstrations that improve scores. It leverages statistical analysis of feedback and output variability to target hard examples and generate self-reflective rules or demonstrations. The optimizer supports both simple scalar metrics and rich dspy.Prediction(score, feedback) signals for finer-grained guidance.

When to use it

  • You need a lighter-weight alternative to GEPA for agent/program optimization
  • Your evaluation uses custom or nuanced feedback signals (not just binary accuracy)
  • Budget constraints require fewer evaluation calls and smaller iteration costs
  • Agentic systems with rich failure traces where targeted fixes help
  • Programs where few-shot example counts are flexible and not critical

Best practices

  • Provide a metric that returns dspy.Prediction(score, feedback) to unlock richer adaptations
  • Tune bsize and max_steps to balance variance and compute (bsize default 32, max_steps default 8)
  • Use lower sampling temperatures (0.1–0.3) to exploit strong candidates and higher to explore
  • Start with smaller num_candidates and max_demos, then scale up after observing behavior
  • Validate optimized programs on a held-out dev set to detect overfitting to noisy feedback

Example use cases

  • Optimizing a QA pipeline where partial matches or verbosity matter and you need nuanced scoring
  • Tuning a ReAct research agent with tool calls and multi-step traces using custom feedback
  • Improving prompt templates for budget-limited production systems where fewer evals are allowed
  • Iteratively adding demonstrations to handle high-variance examples identified by mini-batches
  • Using non-accuracy metrics (e.g., coverage, safety flags, or term overlap) to guide optimization

FAQ

How does SIMBA differ from MIPROv2 and GEPA?

SIMBA is lighter and more budget-friendly than GEPA, focuses on mini-batch Bayesian sampling and feedback-driven edits, and is less data-hungry than MIPROv2 while supporting richer feedback than simple bootstrap approaches.

What kind of metric should I implement?

Prefer returning dspy.Prediction(score, feedback) where feedback explains failure modes; this yields better guidance than a scalar alone and helps SIMBA prioritize high-variance examples.