home / skills / oimiragieo / agent-studio / response-rater

response-rater skill

This skill rates responses and plans against quality rubrics, delivering scores, feedback, and actionable improvement suggestions.

npx playbooks add skill oimiragieo/agent-studio --skill response-rater

Review the files below or copy the command above to add this skill to your agents.

Files (11)

SKILL.md

4.5 KB

---
name: response-rater
description: Rates responses and plans against quality rubrics. Used for plan validation, response quality audits, and multi-agent consensus.
version: 2.0
model: sonnet
invoked_by: both
user_invocable: true
tools: [Read, Write, Edit, Bash, Glob, Grep]
best_practices:
  - Use consistent rubric dimensions
  - Require minimum scores for approval
  - Document improvement suggestions
  - Track scores over time
error_handling: graceful
streaming: supported
---

# Response Rater Skill

<identity>
Response Rater - Rates responses and plans against quality rubrics. Provides scores, feedback, and improvement suggestions.
</identity>

<capabilities>
- Rating responses against rubrics
- Validating plan quality
- Providing improvement feedback
- Generating quality reports
</capabilities>

<instructions>
<execution_process>

### Step 1: Define Rating Rubric

Use appropriate rubric for the content type:

**For Plans**:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| Completeness | 20% | All required sections present |
| Feasibility | 20% | Plan is realistic and achievable |
| Risk Mitigation | 20% | Risks identified with mitigations |
| Agent Coverage | 20% | Appropriate agents assigned |
| Integration | 20% | Fits with existing systems |

**For Responses**:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| Correctness | 25% | Technically accurate |
| Completeness | 25% | Addresses all requirements |
| Clarity | 25% | Easy to understand |
| Actionability | 25% | Provides clear next steps |

### Step 2: Evaluate Each Dimension

Score each dimension 1-10:

```markdown
## Dimension Scores

### Completeness: 8/10

- Has objectives, steps, and timeline
- Missing risk assessment section

### Feasibility: 7/10

- Most steps are achievable
- Step 3 timeline is aggressive

### Risk Mitigation: 5/10

- Only 1 risk identified
- No mitigation strategies

### Agent Coverage: 9/10

- All steps have assigned agents
- Good agent-task matching

### Integration: 8/10

- Uses existing APIs
- Minor compatibility concerns
```

### Step 3: Calculate Overall Score

Average the weighted scores:

```
Overall = (8×0.2) + (7×0.2) + (5×0.2) + (9×0.2) + (8×0.2) = 7.4/10
```

### Step 4: Generate Recommendations

Provide actionable improvements:

```markdown
## Recommendations

### High Priority

1. Add risk assessment section with 3-5 risks
2. Include mitigation strategies for each risk

### Medium Priority

3. Extend Step 3 timeline by 2 days
4. Add fallback plan for external API dependency

### Low Priority

5. Add success metrics for each step
```

### Step 5: Make Pass/Fail Decision

Apply minimum score thresholds:

| Task Type  | Minimum Score |
| ---------- | ------------- |
| Standard   | 7/10          |
| Enterprise | 8/10          |
| Critical   | 9/10          |

```markdown
## Decision

**Score**: 7.4/10
**Threshold**: 7/10 (Standard)
**Result**: PASS ✅

Plan approved with recommendations for improvement.
```

</execution_process>

<best_practices>

1. **Consistent Rubrics**: Use same dimensions for comparable content
2. **Clear Thresholds**: Define pass/fail criteria upfront
3. **Actionable Feedback**: Provide specific improvements
4. **Prioritize Issues**: Order recommendations by impact
5. **Document Rationale**: Explain score justifications

</best_practices>
</instructions>

<examples>
<usage_example>
**Plan Rating Request**:

```
Rate this plan for user authentication implementation
```

**Response**:

```markdown
## Plan Rating: User Authentication

### Scores

- Completeness: 8/10 - Good structure, missing tests section
- Feasibility: 9/10 - Realistic timeline and approach
- Risk Mitigation: 6/10 - Only 1 risk identified
- Agent Coverage: 9/10 - Good agent assignments
- Integration: 8/10 - Compatible with existing auth

### Overall: 8.0/10 ✅ PASS

### Top Recommendations

1. Add risk section with security and dependency risks
2. Include test plan for each authentication flow
3. Add rollback procedure for failed deployment
```

</usage_example>
</examples>

## Rules

- Always use consistent rubric dimensions
- Provide specific, actionable recommendations
- Document score justifications

## Memory Protocol (MANDATORY)

**Before starting:**

```bash
cat .claude/context/memory/learnings.md
```

**After completing:**

- New pattern -> `.claude/context/memory/learnings.md`
- Issue found -> `.claude/context/memory/issues.md`
- Decision made -> `.claude/context/memory/decisions.md`

> ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Overview

This skill rates responses and plans against defined quality rubrics and produces scores, justification, and prioritized improvement suggestions. It supports plan validation, response quality audits, and multi-agent consensus by turning qualitative review into actionable metrics. The output includes per-dimension scores, a weighted overall score, recommendations, and a pass/fail decision against configurable thresholds.

How this skill works

Selects a rubric appropriate to the content type (plan or response) and scores each rubric dimension on a 1–10 scale with short justifications. It computes a weighted overall score, compares it to defined thresholds, and generates prioritized, actionable recommendations. The skill formats a clear decision (pass/fail) and a short report that highlights high-priority fixes and low-effort improvements.

When to use it

Validating proposed execution plans before approval or handoff
Auditing agent or human-generated responses for quality and consistency
Establishing multi-agent consensus by comparing independent outputs
Creating repeatable QA processes for responses and plans
Measuring improvement over time with consistent scoring

Best practices

Use the same rubric dimensions for comparable content to ensure consistent scoring
Define clear pass/fail thresholds for each task criticality level upfront
Provide short, prioritized recommendations—high, medium, low—to guide fixes
Document the rationale for each dimension score to make reviews auditable
Include specific, measurable actions (e.g., add tests, extend timeline by X days)

Example use cases

Rate a deployment plan for a new authentication service to identify missing tests and rollback steps
Audit agent responses in a customer support pipeline for correctness, clarity, and actionability
Compare competing plans from multiple teams and produce a single ranked recommendation
Run periodic quality reports to track improvements after remediation
Validate responses before publishing in high-stakes or regulated contexts

FAQ

What rubrics are used?

Plans use dimensions like Completeness, Feasibility, Risk Mitigation, Agent Coverage, and Integration. Responses use Correctness, Completeness, Clarity, and Actionability.

How is the overall score calculated?

Each dimension is scored 1–10 and combined using the rubric's weights to compute a weighted average overall score.

How are recommendations prioritized?

Recommendations are ordered by impact and urgency into High, Medium, and Low priority, with concrete next steps for each item.