home / skills / anton-abyzov / specweave / increment-quality-judge-v2

increment-quality-judge-v2 skill

/plugins/specweave/skills/increment-quality-judge-v2

This skill assesses increment quality using LLM-as-Judge with BMAD risk scoring and formal PASS/CONCERNS/FAIL gates, delivering actionable recommendations.

This is most likely a fork of the sw-increment-quality-judge-v2 skill from openclaw
npx playbooks add skill anton-abyzov/specweave --skill increment-quality-judge-v2

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
17.7 KB
---
name: increment-quality-judge-v2
description: AI-powered quality assessment using LLM-as-Judge pattern with BMAD risk scoring and formal gate decisions. Use for evaluating increment specs, assessing task completeness, or making quality gate decisions (PASS/CONCERNS/FAIL). Chain-of-thought reasoning ensures transparent evaluation.
allowed-tools: Read, Grep, Glob
---

# Increment Quality Judge v2.0

**LLM-as-Judge Pattern Implementation**

AI-powered quality assessment using the **LLM-as-Judge** pattern - an established AI/ML evaluation technique where an LLM evaluates outputs with chain-of-thought reasoning, BMAD-pattern risk scoring, and formal quality gate decisions (PASS/CONCERNS/FAIL).

## LLM-as-Judge: What It Is

**LLM-as-Judge (LaaJ)** is a recognized pattern in AI/ML evaluation where a large language model assesses quality using structured reasoning.

```
┌─────────────────────────────────────────────────────────────┐
│                 LLM-as-Judge Pattern                        │
├─────────────────────────────────────────────────────────────┤
│  Input:  spec.md, plan.md, tasks.md                        │
│                                                             │
│  Process:                                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ <thinking>                                          │   │
│  │   1. Read and understand the specification          │   │
│  │   2. Evaluate against 7 quality dimensions          │   │
│  │   3. Identify risks (P×I scoring)                   │   │
│  │   4. Form evidence-based verdict                    │   │
│  │ </thinking>                                         │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Output: Structured verdict with:                          │
│  • Dimension scores (0-100)                                │
│  • Risk assessment (CRITICAL/HIGH/MEDIUM/LOW)              │
│  • Quality gate decision (PASS/CONCERNS/FAIL)              │
│  • Actionable recommendations                              │
└─────────────────────────────────────────────────────────────┘
```

**Why LLM-as-Judge works:**
- **Consistency**: Uniform evaluation criteria without human fatigue
- **Reasoning**: Chain-of-thought explains WHY something is an issue
- **Scalability**: Evaluates in seconds vs hours of manual review
- **Industry standard**: Used by OpenAI, Anthropic, Google for AI evals

**References:**
- "Judging LLM-as-a-Judge" (NeurIPS 2023)
- LMSYS Chatbot Arena evaluation methodology
- AlpacaEval, MT-Bench frameworks

## IMPORTANT: This is a SKILL (Not an Agent)

**DO NOT try to spawn this as an agent via Task tool.**

This is a **skill** that auto-activates when you discuss quality assessment. To run quality assessment:

```bash
# Use the CLI command directly
specweave qa 0001 --pre

# Or use the slash command
/sw:qa 0001
```

The skill provides guidance and documentation. The CLI handles execution.

**Why no agent?** Having both a skill and agent with the same name (`increment-quality-judge-v2`) caused Claude to incorrectly construct agent type names. The skill-only approach eliminates this confusion.

## What's New in v2.0

1. **Risk Assessment Dimension** - Probability × Impact scoring (0-10 scale, BMAD pattern)
2. **Quality Gate Decisions** - Formal PASS/CONCERNS/FAIL with thresholds
3. **NFR Checking** - Non-functional requirements (performance, security, scalability)
4. **Enhanced Output** - Blockers, concerns, recommendations with actionable mitigations
5. **7 Dimensions** - Added "Risk" to the existing 6 dimensions

## Purpose

Provide comprehensive quality assessment that goes beyond structural validation to evaluate:
- ✅ Specification quality (6 dimensions)
- ✅ **Risk levels (BMAD P×I scoring)** - NEW!
- ✅ **Quality gate readiness (PASS/CONCERNS/FAIL)** - NEW!

## When to Use

**Auto-activates for**:
- `/qa {increment-id}` command
- `/qa {increment-id} --pre` (pre-implementation check)
- `/qa {increment-id} --gate` (quality gate check)
- Natural language: "assess quality of increment 0001"

**Keywords**:
- validate quality, quality check, assess spec
- evaluate increment, spec review, quality score
- risk assessment, qa check, quality gate
- PASS/CONCERNS/FAIL

## Evaluation Dimensions (7 total, was 6)

```yaml
dimensions:
  clarity:
    weight: 0.18 # was 0.20
    criteria:
      - "Is the problem statement clear?"
      - "Are objectives well-defined?"
      - "Is terminology consistent?"

  testability:
    weight: 0.22 # was 0.25
    criteria:
      - "Are acceptance criteria testable?"
      - "Can success be measured objectively?"
      - "Are edge cases identifiable?"

  completeness:
    weight: 0.18 # was 0.20
    criteria:
      - "Are all requirements addressed?"
      - "Is error handling specified?"
      - "Are non-functional requirements included?"

  feasibility:
    weight: 0.13 # was 0.15
    criteria:
      - "Is the architecture scalable?"
      - "Are technical constraints realistic?"
      - "Is timeline achievable?"

  maintainability:
    weight: 0.09 # was 0.10
    criteria:
      - "Is design modular?"
      - "Are extension points identified?"
      - "Is technical debt addressed?"

  edge_cases:
    weight: 0.09 # was 0.10
    criteria:
      - "Are failure scenarios covered?"
      - "Are performance limits specified?"
      - "Are security considerations included?"

  # NEW: Risk Assessment (BMAD pattern)
  risk:
    weight: 0.11 # NEW!
    criteria:
      - "Are security risks identified and mitigated?"
      - "Are technical risks (scalability, performance) addressed?"
      - "Are implementation risks (complexity, dependencies) managed?"
      - "Are operational risks (monitoring, support) considered?"
```

## Risk Assessment (BMAD Pattern) - NEW!

### Risk Scoring Formula

```
Risk Score = Probability × Impact

Probability (0.0-1.0):
- 0.0-0.3: Low (unlikely to occur)
- 0.4-0.6: Medium (may occur)
- 0.7-1.0: High (likely to occur)

Impact (1-10):
- 1-3: Minor (cosmetic, no user impact)
- 4-6: Moderate (some impact, workaround exists)
- 7-9: Major (significant impact, no workaround)
- 10: Critical (system failure, data loss, security breach)

Final Score (0.0-10.0):
- 9.0-10.0: CRITICAL risk (FAIL quality gate)
- 6.0-8.9: HIGH risk (CONCERNS quality gate)
- 3.0-5.9: MEDIUM risk (PASS with monitoring)
- 0.0-2.9: LOW risk (PASS)
```

### Risk Categories

1. **Security Risks**
   - OWASP Top 10 vulnerabilities
   - Data exposure, authentication, authorization
   - Cryptographic failures

2. **Technical Risks**
   - Architecture complexity, scalability bottlenecks
   - Performance issues, technical debt

3. **Implementation Risks**
   - Tight timeline, external dependencies
   - Technical complexity

4. **Operational Risks**
   - Lack of monitoring, difficult to maintain
   - Poor documentation

### Risk Assessment Prompt

```markdown
You are evaluating SOFTWARE RISKS for an increment using BMAD's Probability × Impact scoring.

Read increment files:
- .specweave/increments/{id}/spec.md
- .specweave/increments/{id}/plan.md

For EACH risk you identify:

1. **Calculate PROBABILITY** (0.0-1.0)
   - Based on spec clarity, past experience, complexity
   - Low: 0.2, Medium: 0.5, High: 0.8

2. **Calculate IMPACT** (1-10)
   - 10 = Critical (security breach, data loss, system failure)
   - 7-9 = Major (significant user impact, no workaround)
   - 4-6 = Moderate (some impact, workaround exists)
   - 1-3 = Minor (cosmetic, no user impact)

3. **Calculate RISK SCORE** = Probability × Impact

4. **Provide MITIGATION** strategy

5. **Link to ACCEPTANCE CRITERIA** (if applicable)

Output format (JSON):
{
  "risks": [
    {
      "id": "RISK-001",
      "category": "security",
      "title": "Password storage not specified",
      "description": "Spec doesn't mention password hashing algorithm",
      "probability": 0.9,
      "impact": 10,
      "score": 9.0,
      "severity": "CRITICAL",
      "mitigation": "Use bcrypt or Argon2, never plain text",
      "location": "spec.md, Authentication section",
      "acceptance_criteria": "AC-US1-01"
    }
  ],
  "overall_risk_score": 7.5,
  "dimension_score": 0.35
}
```

## Quality Gate Decisions - NEW!

### Decision Logic

```typescript
enum QualityGateDecision {
  PASS = "PASS",          // Ready for production
  CONCERNS = "CONCERNS",  // Issues found, should address
  FAIL = "FAIL"           // Blockers, must fix
}

Thresholds (BMAD pattern):

FAIL if any:
- Risk score ≥ 9.0 (CRITICAL)
- Test coverage < 60%
- Spec quality < 50
- Critical security vulnerabilities ≥ 1

CONCERNS if any:
- Risk score 6.0-8.9 (HIGH)
- Test coverage < 80%
- Spec quality < 70
- High security vulnerabilities ≥ 1

PASS otherwise
```

### Output Example

```bash
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QA ASSESSMENT: Increment 0008-user-authentication
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Overall Score: 82/100 (GOOD) ✓

Dimension Scores:
  Clarity:         90/100 ✓✓
  Testability:     75/100 ⚠️
  Completeness:    88/100 ✓
  Feasibility:     85/100 ✓
  Maintainability: 80/100 ✓
  Edge Cases:      70/100 ⚠️
  Risk Assessment: 65/100 ⚠️  (7.2/10 risk score)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RISKS IDENTIFIED (3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔴 RISK-001: CRITICAL (9.0/10)
   Category: Security
   Title: Password storage implementation
   Description: Spec doesn't specify password hashing
   Probability: 0.9 (High) × Impact: 10 (Critical)
   Location: spec.md, Authentication section
   Mitigation: Use bcrypt/Argon2, never plain text
   AC: AC-US1-01

🟡 RISK-002: HIGH (6.0/10)
   Category: Security
   Title: Rate limiting not specified
   Description: No brute-force protection mentioned
   Probability: 0.6 (Medium) × Impact: 10 (Critical)
   Location: spec.md, Security section
   Mitigation: Add 5 failed attempts → 15 min lockout
   AC: AC-US1-03

🟢 RISK-003: LOW (2.4/10)
   Category: Technical
   Title: Session storage scalability
   Description: Plan uses in-memory sessions
   Probability: 0.4 (Medium) × Impact: 6 (Moderate)
   Location: plan.md, Architecture section
   Mitigation: Use Redis for session store

Overall Risk Score: 7.2/10 (MEDIUM-HIGH)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QUALITY GATE DECISION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🟡 CONCERNS (Not Ready for Production)

Blockers (MUST FIX):
  1. 🔴 CRITICAL RISK: Password storage (Risk ≥9)
     → Add task: "Implement bcrypt password hashing"

Concerns (SHOULD FIX):
  2. 🟡 HIGH RISK: Rate limiting not specified (Risk ≥6)
     → Update spec.md: Add rate limiting section
     → Add E2E test for rate limiting

  3. ⚠️  Testability: 75/100 (target: 80+)
     → Make acceptance criteria more measurable

Recommendations (NICE TO FIX):
  4. Edge cases: 70/100
     → Add error handling scenarios
  5. Session scalability
     → Consider Redis for session store

Decision: Address 1 blocker before proceeding

Would you like to:
  [E] Export blockers to tasks.md
  [U] Update spec.md with fixes (experimental)
  [C] Continue without changes
```

## Workflow Integration

### Quick Mode (Default)

```
User: /sw:qa 0001

Step 1: Rule-based validation (120 checks) - FREE, FAST
├── If FAILED → Stop, show errors
└── If PASSED → Continue

Step 2: AI Quality Assessment (Quick)
├── Spec quality (6 dimensions)
├── Risk assessment (BMAD P×I)
└── Quality gate decision (PASS/CONCERNS/FAIL)

Output: Enhanced report with risks and gate decision
```

### Pre-Implementation Mode

```
User: /sw:qa 0001 --pre

Checks:
✅ Spec quality (clarity, testability, completeness)
✅ Risk assessment (identify issues early)
✅ Architecture review (plan.md soundness)
✅ Test strategy (test plan in tasks.md)

Gate decision before implementation starts
```

### Quality Gate Mode

```
User: /sw:qa 0001 --gate

Comprehensive checks:
✅ All pre-implementation checks
✅ Test coverage (AC-ID coverage, gaps)
✅ E2E test coverage
✅ Documentation completeness

Final gate decision before closing increment
```

## Enhanced Scoring Algorithm

### Step 1: Dimension Evaluation (7 dimensions)

For each dimension (including NEW risk dimension), use Chain-of-Thought prompting:

```markdown
<thinking>
1. Read spec.md thoroughly
2. For risk dimension specifically:
   - Identify all risks (security, technical, implementation, operational)
   - For each risk: calculate P, I, Score
   - Group by category
   - Calculate overall risk score
3. For other dimensions: evaluate criteria as before
4. Score 0.00-1.00
5. Identify issues
6. Provide suggestions
</thinking>

Score: 0.XX
```

### Step 2: Weighted Overall Score (NEW weights)

```typescript
overall_score =
  (clarity * 0.18) +
  (testability * 0.22) +
  (completeness * 0.18) +
  (feasibility * 0.13) +
  (maintainability * 0.09) +
  (edge_cases * 0.09) +
  (risk * 0.11)  // NEW!
```

### Step 3: Quality Gate Decision

```typescript
gate_decision = decide({
  spec_quality: overall_score,
  risk_score: risk_assessment.overall_risk_score,
  test_coverage: test_coverage.percentage, // if available
  security_audit: security_audit  // if available
})
```

## Token Usage

**Estimated per increment** (Quick mode):
- Small spec (<100 lines): ~2,500 tokens (~$0.025)
- Medium spec (100-250 lines): ~3,500 tokens (~$0.035)
- Large spec (>250 lines): ~5,000 tokens (~$0.050)

**Cost increase from v1.0**: +25% (added risk assessment dimension)

**Optimization**:
- Only evaluate spec.md + plan.md for risks
- Cache risk patterns for 5 min
- Skip risk assessment if spec < 50 lines (too small to assess)

## Configuration

```json
{
  "qa": {
    "qualityGateThresholds": {
      "fail": {
        "riskScore": 9.0,
        "testCoverage": 60,
        "specQuality": 50,
        "criticalVulnerabilities": 1
      },
      "concerns": {
        "riskScore": 6.0,
        "testCoverage": 80,
        "specQuality": 70,
        "highVulnerabilities": 1
      }
    },
    "dimensions": {
      "risk": {
        "enabled": true,
        "weight": 0.11
      }
    }
  }
}
```

## Migration from v1.0

**v1.0 (6 dimensions)**:
- Clarity, Testability, Completeness, Feasibility, Maintainability, Edge Cases

**v2.0 (7 dimensions, NEW: Risk)**:
- All v1.0 dimensions + Risk Assessment
- Weights adjusted to accommodate new dimension
- Quality gate decisions added
- BMAD risk scoring added

**Backward Compatibility**:
- v1.0 skills still work (auto-upgrade to v2.0 if risk assessment enabled)
- Existing scores rescaled to new weights automatically
- Can disable risk assessment in config to revert to v1.0 behavior

## Best Practices

1. **Run early and often**: Use `--pre` mode before implementation
2. **Fix blockers immediately**: Don't proceed if FAIL
3. **Address concerns before release**: CONCERNS = should fix
4. **Use risk scores to prioritize**: Fix CRITICAL risks first
5. **Export to tasks.md**: Convert blockers/concerns to actionable tasks

## Limitations

**What quality-judge v2.0 CAN'T do**:
- ❌ Understand domain-specific compliance (HIPAA, PCI-DSS)
- ❌ Verify technical feasibility with actual codebase
- ❌ Replace human expertise and security audits
- ❌ Predict actual probability without historical data

**What quality-judge v2.0 CAN do**:
- ✅ Catch vague or ambiguous language
- ✅ Identify missing security considerations (OWASP-based)
- ✅ Spot untestable acceptance criteria
- ✅ Suggest industry best practices
- ✅ Flag missing edge cases
- ✅ **Assess risks systematically (BMAD pattern)** - NEW!
- ✅ **Provide formal quality gate decisions** - NEW!

## Summary

**increment-quality-judge v2.0** adds comprehensive risk assessment and quality gate decisions:

✅ **Risk assessment** (BMAD P×I scoring, 0-10 scale)
✅ **Quality gate decisions** (PASS/CONCERNS/FAIL with thresholds)
✅ **7 dimensions** (added "Risk" to existing 6)
✅ **NFR checking** (performance, security, scalability)
✅ **Enhanced output** (blockers, concerns, recommendations)
✅ **Chain-of-thought** (LLM-as-Judge 2025 best practices)
✅ **Backward compatible** (can disable risk assessment)

**Use it when**: You want comprehensive quality assessment with risk scoring and formal gate decisions before implementation or release.

**Skip it when**: Quick iteration, tight token budget, or simple features where rule-based validation suffices.

---

**Version**: 2.0.0
**Related**: /sw:qa command, QAOrchestrator agent

Overview

This skill provides AI-powered quality assessment for increments using the LLM-as-Judge pattern with BMAD probability×impact risk scoring and formal quality gate decisions (PASS/CONCERNS/FAIL). It evaluates increment specs, plans, and tasks to produce dimension scores, risk details, and actionable recommendations. Chain-of-thought reasoning is used to make the verdicts transparent and evidence-based.

How this skill works

The skill reads spec.md and plan.md (and tasks where available), evaluates seven dimensions including a new risk dimension, and computes weighted scores. For each risk it calculates probability (0.0–1.0) and impact (1–10) to produce a BMAD risk score and severity label. Finally it applies configurable thresholds to emit a quality gate decision and a prioritized list of blockers, concerns, and mitigations.

When to use it

  • Pre-implementation checks to catch blockers before coding
  • Run quick QA for an increment during planning (/sw:qa {id} or CLI)
  • Perform a formal quality gate before closing an increment (--gate)
  • Assess task completeness and acceptance criteria effectiveness
  • Generate risk-first guidance for security, performance, and operational gaps

Best practices

  • Supply complete spec.md and plan.md so risk scoring is accurate
  • Run --pre early to fix high-severity risks before implementation
  • Treat CRITICAL risk scores (≥9.0) as blockers and address before proceeding
  • Use the exported blockers to create tasks and link acceptance criteria
  • Tune thresholds for your org (test coverage, spec quality) to match risk tolerance

Example use cases

  • Validate increment readiness for production with PASS/CONCERNS/FAIL outcome
  • Identify and prioritize security and scalability risks using BMAD scoring
  • Improve acceptance criteria and testability before handing work to engineers
  • Export blockers to tasks.md or create Jira tickets from identified issues
  • Run automated QA during CI to prevent low-quality increments reaching main branch

FAQ

What files does the skill read?

It primarily inspects .specweave/increments/{id}/spec.md and plan.md, and tasks.md when available.

What triggers a FAIL decision?

FAIL occurs when any critical condition is met, e.g., risk score ≥9.0, test coverage <60%, spec quality <50, or critical security vulnerabilities ≥1.