home / skills / yonatangross / orchestkit / assess

assess skill

Q: How is the composite score calculated?

A weighted average across six dimensions (correctness, maintainability, performance, security, scalability, testability) using predefined weights to produce a 0-10 composite.

Q: What if the target has more than 30 files?

The skill prioritizes files (entry points, configs, security-sensitive, core logic) and samples up to 30 files, notifying you of sampling so you can refine scope.

safe

/plugins/ork/skills/assess

This skill evaluates code, designs, or approaches with a structured 0-10 score, pros/cons, and practical improvement recommendations.

npx playbooks add skill yonatangross/orchestkit --skill assess

Review the files below or copy the command above to add this skill to your agents.

Files (12)

SKILL.md

14.7 KB

---
name: assess
license: MIT
compatibility: "Claude Code 2.1.34+. Requires memory MCP server."
description: "Assesses and rates quality 0-10 with pros/cons analysis. Use when evaluating code, designs, or approaches."
context: fork
version: 1.1.0
author: OrchestKit
tags: [assessment, evaluation, quality, comparison, pros-cons, rating]
user-invocable: true
allowed-tools: [AskUserQuestion, Read, Grep, Glob, Task, TaskCreate, TaskUpdate, TaskList, mcp__memory__search_nodes, Bash]
skills: [code-review-playbook, assess-complexity, quality-gates, architecture-decision-record, memory]
argument-hint: [code-path-or-topic]
complexity: high
metadata:
  category: document-asset-creation
  mcp-server: memory
---

# Assess

Comprehensive assessment skill for answering "is this good?" with structured evaluation, scoring, and actionable recommendations.

## Quick Start

```bash
/assess backend/app/services/auth.py
/assess our caching strategy
/assess the current database schema
/assess frontend/src/components/Dashboard
```

---

## STEP 0: Verify User Intent with AskUserQuestion

**BEFORE creating tasks**, clarify assessment dimensions:

```python
AskUserQuestion(
  questions=[{
    "question": "What dimensions to assess?",
    "header": "Dimensions",
    "options": [
      {"label": "Full assessment (Recommended)", "description": "All dimensions: quality, maintainability, security, performance"},
      {"label": "Code quality only", "description": "Readability, complexity, best practices"},
      {"label": "Security focus", "description": "Vulnerabilities, attack surface, compliance"},
      {"label": "Quick score", "description": "Just give me a 0-10 score with brief notes"}
    ],
    "multiSelect": false
  }]
)
```

**Based on answer, adjust workflow:**
- **Full assessment**: All 7 phases, parallel agents
- **Code quality only**: Skip security and performance phases
- **Security focus**: Prioritize security-auditor agent
- **Quick score**: Single pass, brief output

---

## STEP 0b: Select Orchestration Mode

Choose **Agent Teams** (mesh — assessors cross-validate scores) or **Task tool** (star — all report to lead):

```python
import os
teams_available = os.environ.get("CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS") is not None
force_task_tool = os.environ.get("ORCHESTKIT_FORCE_TASK_TOOL") == "1"

if force_task_tool or not teams_available:
    mode = "task_tool"
else:
    # Teams available — use for full multi-dimensional assessment
    mode = "agent_teams" if dimensions == "full" else "task_tool"
```

1. `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS` set → **Agent Teams mode** (for full assessment)
2. Flag not set → **Task tool mode** (default)
3. Quick score or single-dimension → **Task tool** (regardless of flag)

| Aspect | Task Tool | Agent Teams |
|--------|-----------|-------------|
| Score calibration | Lead normalizes independently | Assessors discuss disagreements |
| Cross-dimension findings | Lead correlates after completion | Security assessor alerts performance assessor of overlap |
| Cost | ~200K tokens | ~500K tokens |
| Best for | Quick scores, single dimension | Full multi-dimensional assessment |

> **Context window:** For full codebase assessments (>20 files), use the 1M context window to avoid agent context exhaustion. On 200K context, the scope discovery in Phase 1.5 limits files to prevent overflow.

> **Fallback:** If Agent Teams encounters issues, fall back to Task tool for remaining assessment.

---

## Task Management (CC 2.1.16)

```python
# Create main assessment task
TaskCreate(
  subject="Assess: {target}",
  description="Comprehensive evaluation with quality scores and recommendations",
  activeForm="Assessing {target}"
)

# Create subtasks for 7-phase process
for phase in ["Understand target", "Rate quality", "List pros/cons",
              "Compare alternatives", "Generate suggestions",
              "Estimate effort", "Compile report"]:
    TaskCreate(subject=phase, activeForm=f"{phase}ing")
```

---

## What This Skill Answers

| Question | How It's Answered |
|----------|-------------------|
| "Is this good?" | Quality score 0-10 with reasoning |
| "What are the trade-offs?" | Structured pros/cons list |
| "Should we change this?" | Improvement suggestions with effort |
| "What are the alternatives?" | Comparison with scores |
| "Where should we focus?" | Prioritized recommendations |

---

## Workflow Overview

| Phase | Activities | Output |
|-------|------------|--------|
| **1. Target Understanding** | Read code/design, identify scope | Context summary |
| **2. Quality Rating** | 6-dimension scoring (0-10) | Scores with reasoning |
| **3. Pros/Cons Analysis** | Strengths and weaknesses | Balanced evaluation |
| **4. Alternative Comparison** | Score alternatives | Comparison matrix |
| **5. Improvement Suggestions** | Actionable recommendations | Prioritized list |
| **6. Effort Estimation** | Time and complexity estimates | Effort breakdown |
| **7. Assessment Report** | Compile findings | Final report |

---

## Phase 1: Target Understanding

Identify what's being assessed (code, design, approach, decision, pattern) and gather context:

```python
# PARALLEL - Gather context
Read(file_path="$ARGUMENTS")  # If file path
Grep(pattern="$ARGUMENTS", output_mode="files_with_matches")
mcp__memory__search_nodes(query="$ARGUMENTS")  # Past decisions
```

---

## Phase 1.5: Scope Discovery (CRITICAL — prevents context exhaustion)

**Before spawning any agents**, build a bounded file list. Agents that receive unbounded targets will exhaust their context windows reading the entire codebase.

```python
# 1. Discover target files
if is_file(target):
    scope_files = [target]
elif is_directory(target):
    scope_files = Glob(f"{target}/**/*.{{py,ts,tsx,js,jsx,go,rs,java}}")
else:
    # Concept/topic — search for relevant files
    scope_files = Grep(pattern=target, output_mode="files_with_matches", head_limit=50)

# 2. Apply limits — MAX 30 files for agent assessment
MAX_FILES = 30
if len(scope_files) > MAX_FILES:
    # Prioritize: entry points, configs, security-critical, then sample rest
    # Skip: test files (except for testability agent), generated files, vendor/
    prioritized = prioritize_files(scope_files)  # entry points first
    scope_files = prioritized[:MAX_FILES]
    # Tell user about sampling
    print(f"Target has {len(scope_files)} files. Sampling {MAX_FILES} representative files.")

# 3. Format as file list string for agent prompts
file_list = "\n".join(f"- {f}" for f in scope_files)
```

**Sampling priorities** (when >30 files):
1. Entry points (main, index, app, server)
2. Config files (settings, env, config)
3. Security-sensitive (auth, middleware, api routes)
4. Core business logic (services, models, domain)
5. Representative samples from remaining directories

---

## Phase 2: Quality Rating (6 Dimensions)

Rate each dimension 0-10 with weighted composite score. See [Scoring Rubric](references/scoring-rubric.md) for details.

| Dimension | Weight | What It Measures |
|-----------|--------|------------------|
| Correctness | 0.20 | Does it work correctly? |
| Maintainability | 0.20 | Easy to understand/modify? |
| Performance | 0.15 | Efficient, no bottlenecks? |
| Security | 0.15 | Follows best practices? |
| Scalability | 0.15 | Handles growth? |
| Testability | 0.15 | Easy to test? |

**Composite Score:** Weighted average of all dimensions.

Launch parallel agents with `run_in_background=True`. **Always include the scoped file list from Phase 1.5** in every agent prompt — agents without scope constraints will exhaust their context windows.

### Phase 2 — Task Tool Mode (Default)

For each dimension, spawn a background agent with **scope constraints**:

```python
for dimension, agent_type in [
    ("CORRECTNESS + MAINTAINABILITY", "code-quality-reviewer"),
    ("SECURITY", "security-auditor"),
    ("PERFORMANCE + SCALABILITY", "python-performance-engineer"),  # Use python-performance-engineer for backend; frontend-performance-engineer for frontend
    ("TESTABILITY", "test-generator"),
]:
    Task(subagent_type=agent_type, run_in_background=True, max_turns=15,
         prompt=f"""Assess {dimension} (0-10) for: {target}

## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files — do NOT explore beyond this list:
{file_list}

Budget: Use at most 15 tool calls. Read files from the list above, then produce your score
with reasoning, evidence, and 2-3 specific improvement suggestions.
Do NOT use Glob or Grep to discover additional files.""")
```

Then collect results from all agents and proceed to Phase 3.

### Phase 2 — Agent Teams Alternative

In Agent Teams mode, form an assessment team where dimension assessors cross-validate scores and discuss disagreements:

```python
TeamCreate(team_name="assess-{target-slug}", description="Assess {target}")

# SCOPE CONSTRAINT (injected into every agent prompt):
SCOPE_INSTRUCTIONS = f"""
## Scope Constraint
ONLY read and analyze the following {len(scope_files)} files — do NOT explore beyond this list:
{file_list}

Budget: Use at most 15 tool calls. Read files from the list above, then score.
Do NOT use Glob or Grep to discover additional files.
"""

Task(subagent_type="code-quality-reviewer", name="correctness-assessor",
     team_name="assess-{target-slug}", max_turns=15,
     prompt=f"""Assess CORRECTNESS (0-10) and MAINTAINABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When you find issues that affect security, message security-assessor.
     When you find issues that affect performance, message perf-assessor.
     Share your scores with all teammates for calibration — if scores diverge
     significantly (>2 points), discuss the disagreement.""")

Task(subagent_type="security-auditor", name="security-assessor",
     team_name="assess-{target-slug}", max_turns=15,
     prompt=f"""Assess SECURITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When correctness-assessor flags security-relevant patterns, investigate deeper.
     When you find performance-impacting security measures, message perf-assessor.
     Share your score and flag any cross-dimension trade-offs.""")

Task(subagent_type="python-performance-engineer", name="perf-assessor",  # or frontend-performance-engineer for frontend
     team_name="assess-{target-slug}", max_turns=15,
     prompt=f"""Assess PERFORMANCE (0-10) and SCALABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     When security-assessor flags performance trade-offs, evaluate the impact.
     When you find testability issues (hard-to-benchmark code), message test-assessor.
     Share your scores with reasoning for the composite calculation.""")

Task(subagent_type="test-generator", name="test-assessor",
     team_name="assess-{target-slug}", max_turns=15,
     prompt=f"""Assess TESTABILITY (0-10) for: {target}
     {SCOPE_INSTRUCTIONS}
     Evaluate test coverage, test quality, and ease of testing.
     When other assessors flag dimension-specific concerns, verify test coverage
     for those areas. Share your score and any coverage gaps found.""")
```

**Team teardown** after report compilation:
```python
SendMessage(type="shutdown_request", recipient="correctness-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="security-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="perf-assessor", content="Assessment complete")
SendMessage(type="shutdown_request", recipient="test-assessor", content="Assessment complete")
TeamDelete()
```

> **Fallback — Team Formation Failure:** If team formation fails, use standard Phase 2 Task spawns above.
>
> **Fallback — Context Exhaustion:** If agents hit "Context limit reached" before returning scores, collect whatever partial results were produced, then score remaining dimensions yourself using the scoped file list from Phase 1.5. Do NOT re-spawn agents — assess the remaining dimensions inline and proceed to Phase 3.

---

## Phase 3: Pros/Cons Analysis

```markdown
## Pros (Strengths)
| # | Strength | Impact | Evidence |
|---|----------|--------|----------|
| 1 | [strength] | High/Med/Low | [example] |

## Cons (Weaknesses)
| # | Weakness | Severity | Evidence |
|---|----------|----------|----------|
| 1 | [weakness] | High/Med/Low | [example] |

**Net Assessment:** [Strengths outweigh / Balanced / Weaknesses dominate]
**Recommended action:** [Keep as-is / Improve / Reconsider / Rewrite]
```

---

## Phase 4: Alternative Comparison

See [Alternative Analysis](references/alternative-analysis.md) for full comparison template.

| Criteria | Current | Alternative A | Alternative B |
|----------|---------|---------------|---------------|
| Composite | [N.N] | [N.N] | [N.N] |
| Migration Effort | N/A | [1-5] | [1-5] |

---

## Phase 5: Improvement Suggestions

See [Improvement Prioritization](references/improvement-prioritization.md) for effort/impact guidelines.

| Suggestion | Effort (1-5) | Impact (1-5) | Priority (I/E) |
|------------|--------------|--------------|----------------|
| [action] | [N] | [N] | [ratio] |

**Quick Wins** = Effort <= 2 AND Impact >= 4. Always highlight these first.

---

## Phase 6: Effort Estimation

| Timeframe | Tasks | Total |
|-----------|-------|-------|
| Quick wins (< 1hr) | [list] | X min |
| Short-term (< 1 day) | [list] | X hrs |
| Medium-term (1-3 days) | [list] | X days |

---

## Phase 7: Assessment Report

See [Scoring Rubric](references/scoring-rubric.md) for full report template.

```markdown
# Assessment Report: $ARGUMENTS

**Overall Score: [N.N]/10** (Grade: [A+/A/B/C/D/F])

**Verdict:** [EXCELLENT | GOOD | ADEQUATE | NEEDS WORK | CRITICAL]

## Answer: Is This Good?
**[YES / MOSTLY / SOMEWHAT / NO]**
[Reasoning]
```

---

## Grade Interpretation

| Score | Grade | Verdict |
|-------|-------|---------|
| 9.0-10.0 | A+ | EXCELLENT |
| 8.0-8.9 | A | GOOD |
| 7.0-7.9 | B | GOOD |
| 6.0-6.9 | C | ADEQUATE |
| 5.0-5.9 | D | NEEDS WORK |
| 0.0-4.9 | F | CRITICAL |

---

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| 6 dimensions | Comprehensive coverage | All quality aspects without overwhelming |
| 0-10 scale | Industry standard | Easy to understand and compare |
| Parallel assessment | 4 agents (6 dimensions) | Fast, thorough evaluation |
| Effort/Impact scoring | 1-5 scale | Simple prioritization math |

---

## Rules Quick Reference

| Rule | Impact | What It Covers |
|------|--------|----------------|
| [complexity-metrics](rules/complexity-metrics.md) | HIGH | 7-criterion scoring (1-5), complexity levels, thresholds |
| [complexity-breakdown](rules/complexity-breakdown.md) | HIGH | Task decomposition strategies, risk assessment |

## Related Skills

- `assess-complexity` - Task complexity assessment
- `verify` - Post-implementation verification
- `code-review-playbook` - Code review patterns
- `quality-gates` - Quality gate patterns

---

**Version:** 1.1.0 (February 2026)

Overview

This skill assesses and rates the quality of code, designs, or approaches on a 0-10 scale and delivers balanced pros/cons analysis with actionable recommendations. It structures assessment across multiple phases to produce a prioritized improvement plan and effort estimates. Use it to get a concise verdict plus evidence-backed remediation steps.

How this skill works

The workflow first clarifies what dimensions to evaluate (full assessment, code-only, security, or quick score) and bounds the assessment scope to avoid context exhaustion. It discovers relevant files, samples when necessary, then runs parallel or single-pass reviewers for dimensions like correctness, maintainability, performance, security, scalability, and testability. Results are synthesized into scores, pros/cons, alternatives, suggested fixes, and time estimates.

When to use it

Evaluating a repository file or folder before a refactor or release
Reviewing architecture or design choices for trade-offs
Auditing security-sensitive code or authentication flows
Getting a quick 0-10 quality score with brief notes
Prioritizing technical debt and actionable remediation steps

Best practices

Provide a clear target string or file list so scope discovery can sample correctly
Pick ‘full assessment’ only when you can afford higher token and time costs
Limit scope to <=30 files for thorough automated review; prioritize entry points and configs
Use Agent Teams for multi-dimensional cross-checks; use Task tool for quick scores
Respond to clarification questions about dimensions to get targeted recommendations

Example use cases

Assess backend/app/services/auth.py for correctness, security, and testability before deployment
Evaluate a frontend component folder for maintainability and performance regressions
Compare current caching strategy against two alternatives and get migration effort estimates
Audit a database schema for scalability and provide prioritized migration steps

FAQ

How is the composite score calculated?

A weighted average across six dimensions (correctness, maintainability, performance, security, scalability, testability) using predefined weights to produce a 0-10 composite.

What if the target has more than 30 files?

The skill prioritizes files (entry points, configs, security-sensitive, core logic) and samples up to 30 files, notifying you of sampling so you can refine scope.