home / skills / zpankz / mcp-skillset / agent-prompt-evolution

agent-prompt-evolution skill

/agent-prompt-evolution

This skill tracks agent specialization and meta-agent evolution, analyzes ROI, and guides reusable, domain-specific architectures across experiments.

npx playbooks add skill zpankz/mcp-skillset --skill agent-prompt-evolution

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
12.3 KB
---
name: Agent Prompt Evolution
description: Track and optimize agent specialization during methodology development. Use when agent specialization emerges (generic agents show >5x performance gap), multi-experiment comparison needed, or methodology transferability analysis required. Captures agent set evolution (Aβ‚™ tracking), meta-agent evolution (Mβ‚™ tracking), specialization decisions (when/why to create specialized agents), and reusability assessment (universal vs domain-specific vs task-specific). Enables systematic cross-experiment learning and optimized Mβ‚€ evolution. 2-3 hours overhead per experiment.
allowed-tools: Read, Grep, Glob, Edit, Write
---

# Agent Prompt Evolution

**Systematically track how agents specialize during methodology development.**

> Specialized agents emerge from need, not prediction. Track their evolution to understand when specialization adds value.

---

## When to Use This Skill

Use this skill when:
- πŸ”„ **Agent specialization emerges**: Generic agents show >5x performance gap
- πŸ“Š **Multi-experiment comparison**: Want to learn across experiments
- 🧩 **Methodology transferability**: Analyzing what's reusable vs domain-specific
- πŸ“ˆ **Mβ‚€ optimization**: Want to evolve base Meta-Agent capabilities
- 🎯 **Specialization decisions**: Deciding when to create new agents
- πŸ“š **Agent library**: Building reusable agent catalog

**Don't use when**:
- ❌ Single experiment with no specialization
- ❌ Generic agents sufficient throughout
- ❌ No cross-experiment learning goals
- ❌ Tracking overhead not worth insights

---

## Quick Start (10 minutes per iteration)

### Track Agent Evolution in Each Iteration

**iteration-N.md template**:

```markdown
## Agent Set Evolution

### Current Agent Set (Aβ‚™)
1. **coder** (generic) - Write code, implement features
2. **doc-writer** (generic) - Documentation
3. **data-analyst** (generic) - Data analysis
4. **coverage-analyzer** (specialized, created iteration 3) - Analyze test coverage gaps

### Changes from Previous Iteration
- Added: coverage-analyzer (10x speedup for coverage analysis)
- Removed: None
- Modified: None

### Specialization Decision
**Why coverage-analyzer?**
- Generic data-analyst took 45 min for coverage analysis
- Identified 10x performance gap
- Coverage analysis is recurring task (every iteration)
- Domain knowledge: Go coverage tools, gap identification patterns
- **ROI**: 3 hours creation cost, saves 40 min/iteration Γ— 3 remaining iterations = 2 hours saved

### Agent Reusability Assessment
- **coder**: Universal (100% transferable)
- **doc-writer**: Universal (100% transferable)
- **data-analyst**: Universal (100% transferable)
- **coverage-analyzer**: Domain-specific (testing methodology, 70% transferable to other languages)

### System State
- Aβ‚™ β‰  Aₙ₋₁ (new agent added)
- System UNSTABLE (need iteration N+1 to confirm stability)
```

---

## Four Tracking Dimensions

### 1. Agent Set Evolution (Aβ‚™)

**Track changes iteration-to-iteration**:

```
Aβ‚€ = {coder, doc-writer, data-analyst}
A₁ = {coder, doc-writer, data-analyst} (unchanged)
Aβ‚‚ = {coder, doc-writer, data-analyst} (unchanged)
A₃ = {coder, doc-writer, data-analyst, coverage-analyzer} (new specialist)
Aβ‚„ = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (new specialist)
Aβ‚… = {coder, doc-writer, data-analyst, coverage-analyzer, test-generator} (stable)
```

**Stability**: Aβ‚™ == Aₙ₋₁ for convergence

### 2. Meta-Agent Evolution (Mβ‚™)

**Standard Mβ‚€ capabilities**:
1. **observe**: Pattern observation
2. **plan**: Iteration planning
3. **execute**: Agent orchestration
4. **reflect**: Value assessment
5. **evolve**: System evolution

**Track enhancements**:

```
Mβ‚€ = {observe, plan, execute, reflect, evolve}
M₁ = {observe, plan, execute, reflect, evolve, gap-identify} (new capability)
Mβ‚‚ = {observe, plan, execute, reflect, evolve, gap-identify} (stable)
```

**Finding** (from 8 experiments): Mβ‚€ sufficient in all cases (no evolution needed)

### 3. Specialization Decision Tree

**When to create specialized agent**:

```
Decision tree:
1. Is generic agent sufficient? (performance within 2x)
   YES β†’ No specialization
   NO β†’ Continue

2. Is task recurring? (happens β‰₯3 times)
   NO β†’ One-off, tolerate slowness
   YES β†’ Continue

3. Is performance gap >5x?
   NO β†’ Tolerate moderate slowness
   YES β†’ Continue

4. Is creation cost <ROI?
   Creation cost < (Time saved per use Γ— Remaining uses)
   NO β†’ Not worth it
   YES β†’ Create specialized agent
```

**Example** (Bootstrap-002):

```
Task: Test coverage gap analysis
Generic agent (data-analyst): 45 min
Potential specialist (coverage-analyzer): 4.5 min (10x faster)

Recurring: YES (every iteration, 3 remaining)
Performance gap: 10x (>5x threshold)
Creation cost: 3 hours
ROI: (45-4.5) min Γ— 3 = 121.5 min = 2 hours saved
Decision: CREATE (positive ROI)
```

### 4. Reusability Assessment

**Three categories**:

**Universal** (90-100% transferable):
- Generic agents (coder, doc-writer, data-analyst)
- No domain knowledge required
- Applicable across all domains

**Domain-Specific** (60-80% transferable):
- Requires domain knowledge (testing, CI/CD, error handling)
- Patterns apply within domain
- Needs adaptation for other domains

**Task-Specific** (10-30% transferable):
- Highly specialized for particular task
- One-off creation
- Unlikely to reuse

**Examples**:

```
Agent: coverage-analyzer
Domain: Testing methodology
Transferability: 70%
- Go coverage tools (language-specific, 30% adaptation)
- Gap identification patterns (universal, 100%)
- Overall: 70% transferable to Python/Rust/TypeScript testing

Agent: test-generator
Domain: Testing methodology
Transferability: 40%
- Go test syntax (language-specific, 0% to other languages)
- Test pattern templates (moderately transferable, 60%)
- Overall: 40% transferable

Agent: log-analyzer
Domain: Observability
Transferability: 85%
- Log parsing (universal, 95%)
- Pattern recognition (universal, 100%)
- Structured logging concepts (universal, 100%)
- Go slog specifics (language-specific, 20%)
- Overall: 85% transferable
```

---

## Evolution Log Template

Create `agents/EVOLUTION-LOG.md`:

```markdown
# Agent Evolution Log

## Experiment Overview
- Domain: Testing Strategy
- Baseline agents: 3 (coder, doc-writer, data-analyst)
- Final agents: 5 (+coverage-analyzer, +test-generator)
- Specialization count: 2

---

## Iteration-by-Iteration Evolution

### Iteration 0
**Agent Set**: {coder, doc-writer, data-analyst}
**Changes**: None (baseline)
**Observations**: Generic agents sufficient for baseline establishment

### Iteration 3
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer}
**Changes**: +coverage-analyzer
**Reason**: 10x performance gap (45 min β†’ 4.5 min)
**Creation Cost**: 3 hours
**ROI**: Positive (2 hours saved over 3 iterations)
**Reusability**: 70% (domain-specific, testing)

### Iteration 4
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
**Changes**: +test-generator
**Reason**: 200x performance gap (manual test writing too slow)
**Creation Cost**: 4 hours
**ROI**: Massive (saved 10+ hours)
**Reusability**: 40% (task-specific, Go testing)

### Iteration 5
**Agent Set**: {coder, doc-writer, data-analyst, coverage-analyzer, test-generator}
**Changes**: None
**System**: STABLE (Aβ‚™ == Aₙ₋₁)

---

## Specialization Analysis

### coverage-analyzer
**Purpose**: Analyze test coverage, identify gaps
**Performance**: 10x faster than generic data-analyst
**Domain**: Testing methodology
**Transferability**: 70%
**Lessons**: Coverage gap identification patterns are universal, tool integration is language-specific

### test-generator
**Purpose**: Generate test boilerplate from coverage gaps
**Performance**: 200x faster than manual
**Domain**: Testing methodology (Go-specific)
**Transferability**: 40%
**Lessons**: High speedup justified low transferability, patterns reusable but syntax is not

---

## Cross-Experiment Reuse

### From Previous Experiments
- **validation-builder** (from API design experiment) β†’ Used for smoke test validation
- Reusability: Excellent (validation patterns are universal)
- Adaptation: Minimal (10 min to adapt from API to CI/CD context)

### To Future Experiments
- **coverage-analyzer** β†’ Reusable for Python/Rust/TypeScript testing (70% transferable)
- **test-generator** β†’ Less reusable (40% transferable, needs rewrite for other languages)

---

## Meta-Agent Evolution

### Mβ‚€ Capabilities
{observe, plan, execute, reflect, evolve}

### Changes
None (Mβ‚€ sufficient throughout)

### Observations
- Mβ‚€'s "evolve" capability successfully identified need for specialization
- No Meta-Agent evolution required
- Convergence: Mβ‚™ == Mβ‚€ for all iterations

---

## Lessons Learned

### Specialization Decisions
- **10x performance gap** is good threshold (< 5x not worth it, >10x clear win)
- **Positive ROI required**: Creation cost must be justified by time savings
- **Recurring tasks only**: One-off tasks don't justify specialization

### Reusability Patterns
- **Generic agents always reusable**: coder, doc-writer, data-analyst (100%)
- **Domain agents moderately reusable**: coverage-analyzer (70%)
- **Task agents rarely reusable**: test-generator (40%)

### When NOT to Specialize
- Performance gap <5x (tolerable slowness)
- Task is one-off (no recurring benefit)
- Creation cost >ROI (not worth time investment)
- Generic agent will improve with practice (learning curve)
```

---

## Cross-Experiment Analysis

After 3+ experiments, create `agents/CROSS-EXPERIMENT-ANALYSIS.md`:

```markdown
# Cross-Experiment Agent Analysis

## Agent Reuse Matrix

| Agent | Exp1 | Exp2 | Exp3 | Reuse Rate | Transferability |
|-------|------|------|------|------------|-----------------|
| coder | βœ“ | βœ“ | βœ“ | 100% | Universal |
| doc-writer | βœ“ | βœ“ | βœ“ | 100% | Universal |
| data-analyst | βœ“ | βœ“ | βœ“ | 100% | Universal |
| coverage-analyzer | βœ“ | - | βœ“ | 67% | Domain (testing) |
| test-generator | βœ“ | - | - | 33% | Task-specific |
| validation-builder | - | βœ“ | βœ“ | 67% | Domain (validation) |
| log-analyzer | - | - | βœ“ | 33% | Domain (observability) |

## Specialization Patterns

### Universal Agents (100% reuse)
- Generic capabilities (coder, doc-writer, data-analyst)
- No domain knowledge
- Always included in Aβ‚€

### Domain Agents (50-80% reuse)
- Require domain knowledge (testing, CI/CD, observability)
- Reusable within domain
- Examples: coverage-analyzer, validation-builder, log-analyzer

### Task Agents (10-40% reuse)
- Highly specialized
- One-off or rare reuse
- Examples: test-generator (Go-specific)

## Mβ‚€ Sufficiency

**Finding**: Mβ‚€ = {observe, plan, execute, reflect, evolve} sufficient in ALL experiments

**Implications**:
- No Meta-Agent evolution needed
- Base capabilities handle all domains
- Specialization occurs at Agent layer, not Meta-Agent layer

## Specialization Threshold

**Data** (from 3 experiments):
- Average performance gap for specialization: 15x (range: 5x-200x)
- Average creation cost: 3.5 hours (range: 2-5 hours)
- Average ROI: Positive in 8/9 cases (89% success rate)

**Recommendation**: Use 5x performance gap as threshold

---

**Updated**: After each new experiment
```

---

## Success Criteria

Agent evolution tracking succeeded when:

1. **Complete tracking**: All agent changes documented each iteration
2. **Specialization justified**: Each specialized agent has clear ROI
3. **Reusability assessed**: Each agent categorized (universal/domain/task)
4. **Cross-experiment learning**: Patterns identified across 2+ experiments
5. **Mβ‚€ stability documented**: Meta-Agent evolution (or lack thereof) tracked

---

## Related Skills

**Parent framework**:
- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle

**Complementary**:
- [rapid-convergence](../rapid-convergence/SKILL.md) - Agent stability criterion

---

## References

**Core guide**:
- [Evolution Tracking](reference/tracking.md) - Detailed tracking process
- [Specialization Decisions](reference/specialization.md) - Decision tree
- [Reusability Framework](reference/reusability.md) - Assessment rubric

**Examples**:
- [Bootstrap-002 Evolution](examples/testing-strategy-agent-evolution.md) - 2 specialists
- [Bootstrap-007 No Evolution](examples/ci-cd-no-specialization.md) - Generic sufficient

---

**Status**: βœ… Formalized | 2-3 hours overhead | Enables systematic learning

Overview

This skill tracks and optimizes how agents specialize during methodology development. It captures iteration-by-iteration agent set changes, meta-agent capability evolution, specialization decisions, and reusability assessment to inform systematic cross-experiment learning. Expect ~2–3 hours overhead per experiment to keep logs and run ROI checks.

How this skill works

You record the agent set each iteration (Aβ‚™), note additions/removals, and justify specializations with measured performance gaps and ROI. Meta-agent capabilities (Mβ‚™) are tracked to see if base orchestration needs evolving. Use a decision tree to decide when to create specialists and classify agents as universal, domain-specific, or task-specific for reuse analysis.

When to use it

  • When generic agents show >5x performance gap and specialization likely
  • When you need multi-experiment comparison to transfer lessons
  • When optimizing the base Meta-Agent (Mβ‚€) capabilities
  • When deciding whether recurring tasks justify specialist creation
  • When building a reusable agent library across projects

Best practices

  • Log Aβ‚™ changes every iteration with simple templates (agent set, changes, reason, ROI)
  • Measure real task time for generic vs specialist to compute ROI before creation
  • Require recurring occurrence (β‰₯3 times) as a threshold before specializing
  • Use a 5x performance gap as a practical trigger; require creation cost < expected time savings
  • Classify agents into universal/domain/task to prioritize maintenance and reuse efforts

Example use cases

  • Iteration log showing coder/doc-writer/data-analyst baseline, then adding coverage-analyzer after 10x gap
  • Cross-experiment matrix summarizing reuse rates (universal vs domain vs task) after 3+ experiments
  • Applying decision tree: generic slow for recurring coverage analysis β†’ create coverage-analyzer with positive ROI
  • Optimizing Mβ‚€: confirm {observe, plan, execute, reflect, evolve} remains sufficient across experiments
  • Building an agent catalog marking transferability percentages for future project planning

FAQ

What minimum data do I need per iteration?

Agent set, time measurements for key tasks (generic vs specialist), creation cost estimate, recurrence count, and a short reusability assessment.

When should I avoid creating a specialist?

If performance gap <5x, the task is one-off (<3 occurrences remaining), or creation cost exceeds projected time savings.