home / skills / simhacker / moollm / evaluator
This skill assesses client-evaluation outputs against a rubric using an independent evaluator, ensuring unbiased scoring and actionable critique.
npx playbooks add skill simhacker/moollm --skill evaluatorReview the files below or copy the command above to add this skill to your agents.
---
name: evaluator
description: Independent assessment without debate context — adversarial loop prevents gaming
allowed-tools:
- read_file
- write_file
tier: 1
protocol: INDEPENDENT-EVALUATOR
tags: [moollm, evaluation, adversarial, scoring, review]
credits: "Mike Gallaher — independent evaluator pattern"
related: [adversarial-committee, rubric, roberts-rules, room]
---
# Evaluator
> *"Fresh eyes, no bias, just the rubric."*
Committee output goes to a separate model instance with NO debate context.
## The Separation
```yaml
evaluation:
principle: "Evaluator has NO access to:"
- debate_transcript
- speaker_identities
- amendment_history
- voting_patterns
- minority_dissents
evaluator_sees_only:
- final_output
- rubric_criteria
- subject_matter_context
```
## Room Architecture
```yaml
# committee-room/
# ROOM.yml
# debate.yml
# output.yml
# outbox/
# evaluation-request-001.yml
# evaluation-room/
# ROOM.yml
# rubric.yml
# inbox/
# evaluation-request-001.yml # Landed here
# evaluations/
# eval-001.yml
```
## Evaluation Request
```yaml
# Thrown from committee to evaluator
evaluation_request:
id: eval-req-001
from: committee-room
timestamp: "2026-01-05T15:00:00Z"
subject: "Client X Engagement Decision"
output_only: |
Recommendation: Accept Client X with:
- Explicit scope boundaries
- Milestone-based billing
- Quarterly scope review
Confidence: 0.65
Key considerations:
- Revenue opportunity aligns with growth goals
- Risk mitigated by contractual protections
- Capacity impact manageable
rubric: client-evaluation-v1
# Note: NO debate context included
```
## Evaluation Process
```yaml
evaluation:
request: eval-req-001
evaluator: "fresh model instance"
context_loaded: false # Critical!
steps:
1. load_rubric: client-evaluation-v1
2. read_output: evaluation_request.output_only
3. score_each_criterion: independently
4. calculate_weighted_total: true
5. generate_critique: if score < threshold
```
## Evaluation Output
```yaml
# evaluation-room/evaluations/eval-001.yml
evaluation:
id: eval-001
request: eval-req-001
timestamp: "2026-01-05T15:05:00Z"
rubric: client-evaluation-v1
scores:
resource_efficiency:
score: 4
rationale: "Output indicates capacity is manageable"
risk_level:
score: 3
rationale: "Mitigations proposed but not detailed"
confidence: "Would score higher with specific terms"
strategic_alignment:
score: 4
rationale: "Growth goals mentioned, seems aligned"
stakeholder_impact:
score: 3
rationale: "Not explicitly addressed in output"
flag: "Committee should consider stakeholder effects"
weighted_total: 3.45
threshold: 3.5
result: REVIEW # Just below accept
critique:
summary: "Close to acceptance threshold"
gaps:
- "Risk mitigation lacks specifics"
- "Stakeholder impact not addressed"
- "Confidence of 0.65 seems low for recommendation"
suggestions:
- "Detail the milestone structure"
- "Explain how scope boundaries will be enforced"
- "Address impact on existing clients and team"
if_addressed: "Score could reach 3.7+ (accept)"
```
## Revision Loop
```yaml
revision_loop:
max_iterations: 3
flow:
1. committee_outputs: recommendation
2. evaluator_scores: against rubric
3. if score >= threshold: ACCEPT
4. if score < threshold:
- evaluator_generates: critique
- critique_thrown_to: committee inbox
- committee_revises: based on critique
- goto: step 1
5. if max_iterations reached: ESCALATE to human
```
## Adversarial Properties
```yaml
adversarial_separation:
why: "Prevents committee gaming metrics"
committee_cannot:
- see_evaluator_reasoning
- predict_exact_scores
- optimize_for_rubric_loopholes
evaluator_cannot:
- be_influenced_by_debate_dynamics
- favor_particular_speakers
- weight_majority_over_minority
result: "Genuine quality signal"
```
## Commands
| Command | Action |
|---------|--------|
| `EVALUATE [output]` | Send to independent evaluator |
| `APPLY RUBRIC [name]` | Score against criteria |
| `CRITIQUE` | Generate improvement suggestions |
| `REVISE` | Committee addresses critique |
| `ESCALATE` | Send to human decision maker |
## Integration
```mermaid
graph TD
C[Committee] -->|output| T[THROW to outbox]
T -->|lands in| I[Evaluator inbox]
I --> E[Evaluate]
R[RUBRIC.yml] --> E
E --> S{Score}
S -->|≥ threshold| A[✅ Accept]
S -->|< threshold| CR[Generate Critique]
CR -->|THROW back| CI[Committee inbox]
CI --> REV[Revise]
REV --> C
subgraph "No Context Crossing"
I
E
R
end
```
## Model Instance Separation
For true independence:
```yaml
implementation:
option_1:
name: "Fresh conversation"
method: "New chat with no history"
option_2:
name: "Separate model"
method: "Different API call, different instance"
option_3:
name: "System prompt separation"
method: "Explicit instruction: 'You have no prior context'"
key_principle: |
The evaluator must NOT have access to:
- How the committee reached the conclusion
- Who said what
- What alternatives were considered
- Why certain risks were dismissed
Only the final output matters.
```
This skill provides an independent, adversarial evaluation loop that scores committee outputs against a rubric with no access to debate context or speaker identities. It enforces model-instance separation so assessments are fresh, unbiased, and focused solely on the final recommendation. The result is a clear pass/review/escalate signal plus targeted critiques to drive concrete revisions.
The evaluator accepts only the committee's final output, a rubric identifier, and minimal subject-matter context. A fresh model instance loads the rubric, scores each criterion independently, computes a weighted total, and emits a result and critique when scores fall below threshold. Critiques are delivered back to the committee inbox to trigger revision loops or escalation to humans after a capped number of iterations.
Does the evaluator see the debate transcript?
No. The evaluator is deliberately restricted to the final output, rubric, and minimal subject context to prevent bias.
What happens if the score is just below threshold?
The evaluator generates a targeted critique listing gaps and suggestions; the committee can revise up to the configured iteration limit before escalation.
How does this prevent gaming?
By isolating evaluator reasoning and using fresh instances, committees cannot predict exact scores or tune outputs to exploit debate dynamics.