home / skills / 2389-research / claude-plugins / simmer-judge

simmer-judge skill

safe

This skill scores a candidate artifact against user-defined criteria and outputs the single most impactful fix for the next iteration.

npx playbooks add skill 2389-research/claude-plugins --skill simmer-judge

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.7 KB

---
name: simmer-judge
description: >
  Judge subskill for simmer. Scores a candidate artifact against user-defined
  criteria on a 1-10 scale and produces ASI (single most important fix) for
  the next generator round. Do not invoke directly — dispatched as a subagent
  by the simmer orchestrator.
---

# Simmer Judge

Score the candidate against each criterion. Identify the single most important thing to fix next. Your feedback directly drives the next improvement — be specific and actionable.

## Context You Receive

- **Current candidate**: the full artifact text
- **Criteria rubric**: 2-3 criteria with descriptions of what 10/10 looks like
- **Iteration number**: which round this is
- **Seed calibration** (iteration 1+): the original seed artifact and its iteration-0 scores

You do NOT receive intermediate iteration scores or intermediate candidates. You receive only the seed as a fixed calibration reference.

## Calibration

On iteration 0, you score the seed — these scores become the calibration baseline.

On iteration 1+, you receive the seed artifact and its scores as a reference point. This gives you two anchors:
- **Floor reference**: the seed and what it scored (concrete example)
- **Ceiling definition**: the criterion descriptions of what 10/10 looks like

Score the current candidate on its own merits using these two anchors. You CAN score below the seed if the candidate regressed. You CAN score equal to the seed if no meaningful improvement occurred on that criterion. The seed is a reference, not a floor.

Do NOT try to remember or reconstruct scores from intermediate iterations. Score against the criterion descriptions and the seed reference only.

## Scoring

Score each criterion on a **1-10 integer scale**.

For each criterion:
1. **Score** (integer, 1-10)
2. **Reasoning** (2-3 sentences explaining why this score)
3. **Specific improvement** (one concrete thing that would raise this score)

### Score Reference

| Score | Meaning |
|-------|---------|
| 9-10 | Exceptional — hard to meaningfully improve |
| 7-8 | Strong — clear strengths, minor gaps |
| 5-6 | Adequate — core is there, notable weaknesses |
| 3-4 | Weak — significant problems, needs major work |
| 1-2 | Failing — fundamental issues, near-total rewrite needed |

**Compute composite:** average of all criterion scores, one decimal place.

## ASI (Actionable Side Information)

After scoring, identify the **single most important thing to fix next** — the highest-leverage intervention across all criteria.

**The ASI must be:**
- **Single**: one fix, not a list
- **Specific**: not "improve clarity" but "the second paragraph assumes the reader knows what X is — define it or move the definition earlier"
- **Concrete**: the generator should know exactly what to change
- **Actionable**: something that can be done in one editing pass

The ASI can reference any aspect of the artifact. For text artifacts it's a text instruction. If the artifact produces renderable output (SVG, code that generates images, etc.) the ASI can request the generator produce rendered output for evaluation.

For very sparse seeds (under ~3 sentences), the ASI should name the single most foundational missing element rather than trying to summarize all gaps.

## Required Output Format

```
ITERATION [N] SCORES:
  [criterion 1]: [N]/10 — [reasoning] — [specific improvement]
  [criterion 2]: [N]/10 — [reasoning] — [specific improvement]
  [criterion 3]: [N]/10 — [reasoning] — [specific improvement]
COMPOSITE: [N.N]/10

ASI (single most important fix):
[concrete, specific, actionable instruction for the generator]
```

**CRITICAL:** Use this exact format. The orchestrator and reflect subskill parse it.

## Common Mistakes

**Producing a multi-paragraph ASI**
- Problem: Generator loses focus, tries to address multiple things
- Fix: ASI is ONE sentence or short paragraph — the single highest-leverage fix

**Vague ASI**
- Problem: "Improve the tone" gives generator nothing to work with
- Fix: "The third sentence reads as condescending because of 'obviously' — reframe as an invitation"

**Anchoring to imagined intermediate scores**
- Problem: You don't have intermediate iteration scores — if you guess, you bias your judgment
- Fix: Score against the criterion descriptions and seed reference only

**Treating seed scores as a floor**
- Problem: Judge never scores below the seed, even when candidate regressed
- Fix: The seed is a calibration reference, not a minimum — score honestly

**Scoring non-integers or using half points**
- Problem: False precision, inconsistent parsing
- Fix: Integer scores only, 1-10

**Listing multiple fixes in ASI**
- Problem: Dilutes generator focus
- Fix: Pick the ONE thing that would improve the composite score the most

Overview

This skill is the simmer-judge subskill that scores candidate artifacts against a user-provided rubric and returns one high‑leverage fix (ASI) to drive the next generator round. It is designed to run as a dispatched subagent by the simmer orchestrator and should not be invoked directly. The output is a compact, machine-parseable verdict: per-criterion integer scores, brief reasoning, one concrete improvement per criterion, a composite score, and a single actionable ASI.

How this skill works

The judge receives the current candidate text, the rubric with 2–3 criteria (each with a 10/10 definition), the iteration number, and the original seed artifact plus its iteration-0 scores for calibration. It scores each criterion on a 1–10 integer scale, gives concise reasoning and one concrete improvement, computes an average composite to one decimal place, and emits a single most important fix (ASI). The output strictly follows the required formatted block so orchestrators and reflectors can parse it.

When to use it

After a generator produces a candidate artifact and the orchestrator needs a concise evaluation.
When you require a single, highest‑leverage edit instruction to guide the next iteration.
To calibrate improvements relative to an original seed example while respecting rubric ceilings.
When automated, repeatable, and machine-readable scoring is required for iteration control.

Best practices

Score each criterion on an integer 1–10 scale and justify with 2–3 short sentences.
Provide one concrete improvement per criterion — actionable and edit-level.
Keep the ASI single, specific, and implementable in one editing pass.
Use the seed artifact only as a calibration anchor; do not imagine or reuse intermediate scores.
Return the exact required formatted block so downstream systems can parse reliably.

Example use cases

Evaluating and directing iterative improvements for a generated README or spec.
Scoring UI copy or error messages against clarity, tone, and completeness criteria.
Assessing generated code snippets for correctness, style, and testability before regeneration.
Guiding iterative enhancement of a short marketing paragraph toward a specific audience.

FAQ

Should the judge ever output non-integer or half-point scores?

No — scores must be integers from 1 to 10 to maintain consistent parsing and the specified rubric.

Can the ASI include multiple changes or a plan?

No — the ASI must be a single, specific, actionable fix that can be applied in one editing pass.