home / skills / simhacker / moollm / evaluator

evaluator skill

This skill assesses client-evaluation outputs against a rubric using an independent evaluator, ensuring unbiased scoring and actionable critique.

npx playbooks add skill simhacker/moollm --skill evaluator

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

5.4 KB

---
name: evaluator
description: Independent assessment without debate context — adversarial loop prevents gaming
allowed-tools:
  - read_file
  - write_file
tier: 1
protocol: INDEPENDENT-EVALUATOR
tags: [moollm, evaluation, adversarial, scoring, review]
credits: "Mike Gallaher — independent evaluator pattern"
related: [adversarial-committee, rubric, roberts-rules, room]
---

# Evaluator

> *"Fresh eyes, no bias, just the rubric."*

Committee output goes to a separate model instance with NO debate context.

## The Separation

```yaml
evaluation:
  principle: "Evaluator has NO access to:"
    - debate_transcript
    - speaker_identities
    - amendment_history
    - voting_patterns
    - minority_dissents
    
  evaluator_sees_only:
    - final_output
    - rubric_criteria
    - subject_matter_context
```

## Room Architecture

```yaml
# committee-room/
#   ROOM.yml
#   debate.yml
#   output.yml
#   outbox/
#     evaluation-request-001.yml

# evaluation-room/
#   ROOM.yml
#   rubric.yml
#   inbox/
#     evaluation-request-001.yml  # Landed here
#   evaluations/
#     eval-001.yml
```

## Evaluation Request

```yaml
# Thrown from committee to evaluator
evaluation_request:
  id: eval-req-001
  from: committee-room
  timestamp: "2026-01-05T15:00:00Z"
  
  subject: "Client X Engagement Decision"
  
  output_only: |
    Recommendation: Accept Client X with:
    - Explicit scope boundaries
    - Milestone-based billing
    - Quarterly scope review
    
    Confidence: 0.65
    
    Key considerations:
    - Revenue opportunity aligns with growth goals
    - Risk mitigated by contractual protections
    - Capacity impact manageable
    
  rubric: client-evaluation-v1
  
  # Note: NO debate context included
```

## Evaluation Process

```yaml
evaluation:
  request: eval-req-001
  evaluator: "fresh model instance"
  context_loaded: false  # Critical!
  
  steps:
    1. load_rubric: client-evaluation-v1
    2. read_output: evaluation_request.output_only
    3. score_each_criterion: independently
    4. calculate_weighted_total: true
    5. generate_critique: if score < threshold
```

## Evaluation Output

```yaml
# evaluation-room/evaluations/eval-001.yml
evaluation:
  id: eval-001
  request: eval-req-001
  timestamp: "2026-01-05T15:05:00Z"
  
  rubric: client-evaluation-v1
  
  scores:
    resource_efficiency:
      score: 4
      rationale: "Output indicates capacity is manageable"
      
    risk_level:
      score: 3
      rationale: "Mitigations proposed but not detailed"
      confidence: "Would score higher with specific terms"
      
    strategic_alignment:
      score: 4
      rationale: "Growth goals mentioned, seems aligned"
      
    stakeholder_impact:
      score: 3
      rationale: "Not explicitly addressed in output"
      flag: "Committee should consider stakeholder effects"
      
  weighted_total: 3.45
  threshold: 3.5
  
  result: REVIEW  # Just below accept
  
  critique:
    summary: "Close to acceptance threshold"
    
    gaps:
      - "Risk mitigation lacks specifics"
      - "Stakeholder impact not addressed"
      - "Confidence of 0.65 seems low for recommendation"
      
    suggestions:
      - "Detail the milestone structure"
      - "Explain how scope boundaries will be enforced"
      - "Address impact on existing clients and team"
      
    if_addressed: "Score could reach 3.7+ (accept)"
```

## Revision Loop

```yaml
revision_loop:
  max_iterations: 3
  
  flow:
    1. committee_outputs: recommendation
    2. evaluator_scores: against rubric
    3. if score >= threshold: ACCEPT
    4. if score < threshold:
         - evaluator_generates: critique
         - critique_thrown_to: committee inbox
         - committee_revises: based on critique
         - goto: step 1
    5. if max_iterations reached: ESCALATE to human
```

## Adversarial Properties

```yaml
adversarial_separation:
  why: "Prevents committee gaming metrics"
  
  committee_cannot:
    - see_evaluator_reasoning
    - predict_exact_scores
    - optimize_for_rubric_loopholes
    
  evaluator_cannot:
    - be_influenced_by_debate_dynamics
    - favor_particular_speakers
    - weight_majority_over_minority
    
  result: "Genuine quality signal"
```

## Commands

| Command | Action |
|---------|--------|
| `EVALUATE [output]` | Send to independent evaluator |
| `APPLY RUBRIC [name]` | Score against criteria |
| `CRITIQUE` | Generate improvement suggestions |
| `REVISE` | Committee addresses critique |
| `ESCALATE` | Send to human decision maker |

## Integration

```mermaid
graph TD
    C[Committee] -->|output| T[THROW to outbox]
    T -->|lands in| I[Evaluator inbox]
    I --> E[Evaluate]
    R[RUBRIC.yml] --> E
    E --> S{Score}
    S -->|≥ threshold| A[✅ Accept]
    S -->|< threshold| CR[Generate Critique]
    CR -->|THROW back| CI[Committee inbox]
    CI --> REV[Revise]
    REV --> C
    
    subgraph "No Context Crossing"
    I
    E
    R
    end
```

## Model Instance Separation

For true independence:

```yaml
implementation:
  option_1:
    name: "Fresh conversation"
    method: "New chat with no history"
    
  option_2:
    name: "Separate model"
    method: "Different API call, different instance"
    
  option_3:
    name: "System prompt separation"
    method: "Explicit instruction: 'You have no prior context'"
    
  key_principle: |
    The evaluator must NOT have access to:
    - How the committee reached the conclusion
    - Who said what
    - What alternatives were considered
    - Why certain risks were dismissed
    
    Only the final output matters.
```

Overview

This skill provides an independent, adversarial evaluation loop that scores committee outputs against a rubric with no access to debate context or speaker identities. It enforces model-instance separation so assessments are fresh, unbiased, and focused solely on the final recommendation. The result is a clear pass/review/escalate signal plus targeted critiques to drive concrete revisions.

How this skill works

The evaluator accepts only the committee's final output, a rubric identifier, and minimal subject-matter context. A fresh model instance loads the rubric, scores each criterion independently, computes a weighted total, and emits a result and critique when scores fall below threshold. Critiques are delivered back to the committee inbox to trigger revision loops or escalation to humans after a capped number of iterations.

When to use it

When you need an unbiased quality check of a committee recommendation without influence from debate dynamics.
For decisions where gaming of metrics or speaker bias is a risk.
When you require reproducible, rubric-based scoring and an auditable critique trail.
Before final acceptance to surface missing details or weak mitigations.
When you want an automated pre-escalation filter that coordinates human review only when necessary.

Best practices

Design a clear, weighted rubric with explicit criteria and thresholds.
Send only the final output text and rubric reference; exclude transcripts, identities, and discussion history.
Use short, machine-readable evaluation requests to reduce ambiguity.
Limit revision iterations (e.g., 3) and define an escalation path for unresolved gaps.
Log evaluations and critiques separately from committee artifacts for auditability.

Example use cases

Assessing vendor selection recommendations to ensure risk controls are specific.
Reviewing client engagement proposals for resource, risk, and strategic alignment.
Pre-flight quality checks of strategic decisions before board presentation.
Automating triage of proposals where only clear accept/review/escalate outcomes are desired.
Providing consistent, reproducible scoring across multiple teams or regions.

FAQ

Does the evaluator see the debate transcript?

No. The evaluator is deliberately restricted to the final output, rubric, and minimal subject context to prevent bias.

What happens if the score is just below threshold?

The evaluator generates a targeted critique listing gaps and suggestions; the committee can revise up to the configured iteration limit before escalation.

How does this prevent gaming?

By isolating evaluator reasoning and using fresh instances, committees cannot predict exact scores or tune outputs to exploit debate dynamics.