home / skills / athola / claude-night-market / skills-eval

This skill audits Claude skills for quality and efficiency, delivering prioritized improvements to boost performance and reduce token usage.

npx playbooks add skill athola/claude-night-market --skill skills-eval

Review the files below or copy the command above to add this skill to your agents.

Files (26)
SKILL.md
6.1 KB
---
name: skills-eval
description: Evaluate and improve Claude skill quality through auditing. Use when
  reviewing skill quality, preparing skills for production, or auditing existing skills.
  Do not use when creating new skills (use modular-skills) or writing prose (use writing-clearly-and-concisely).
  Use this skill before shipping any skill to production.
category: skill-management
tags:
- evaluation
- improvement
- skills
- optimization
- quality-assurance
- tool-use
- performance-metrics
dependencies:
- modular-skills
- performance-optimization
tools:
- skills-auditor
- improvement-suggester
- compliance-checker
- tool-performance-analyzer
- token-usage-tracker
provides:
  infrastructure:
  - evaluation-framework
  - quality-assurance
  - improvement-planning
  patterns:
  - skill-analysis
  - token-optimization
  - modular-design
  sdk_features:
  - agent-sdk-compatibility
  - advanced-metrics
  - dynamic-discovery
estimated_tokens: 1800
usage_patterns:
- skill-audit
- quality-assessment
- improvement-planning
- skills-inventory
- tool-performance-evaluation
- dynamic-discovery-optimization
- advanced-tool-use-analysis
- programmatic-calling-efficiency
- context-preservation-quality
- token-efficiency-optimization
- modular-architecture-validation
- integration-testing
- compliance-reporting
- performance-benchmarking
complexity: advanced
evaluation_criteria:
  structure_compliance: 25
  metadata_quality: 20
  token_efficiency: 25
  tool_integration: 20
  claude_sdk_compliance: 10
---
# Skills Evaluation and Improvement

## Table of Contents

1. [Overview](#overview)
2. [Quick Start](#quick-start)
3. [Evaluation Workflow](#evaluation-workflow)
4. [Evaluation and Optimization](#evaluation-and-optimization)
5. [Resources](#resources)

## Overview

This framework audits Claude skills against quality standards to improve performance and reduce token consumption. Automated tools analyze skill structure, measure context usage, and identify specific technical improvements. Run verification commands after each audit to confirm fixes work correctly.

The `skills-auditor` provides structural analysis, while the `improvement-suggester` ranks fixes by impact. Compliance is verified through the `compliance-checker`. Runtime efficiency is monitored by `tool-performance-analyzer` and `token-usage-tracker`.

## Quick Start

### Basic Audit
Run a full audit of all skills or target a specific file to identify structural issues.
```bash
# Audit all skills
make audit-all

# Audit specific skill
make audit-skill TARGET=path/to/skill/SKILL.md
```

### Analysis and Optimization
Use `skill_analyzer.py` for complexity checks and `token_estimator.py` to verify the context budget.
```bash
make analyze-skill TARGET=path/to/skill/SKILL.md
make estimate-tokens TARGET=path/to/skill/SKILL.md
```

### Improvements
Generate a prioritized plan and verify standards compliance using `improvement_suggester.py` and `compliance_checker.py`.
```bash
make improve-skill TARGET=path/to/skill/SKILL.md
make check-compliance TARGET=path/to/skill/SKILL.md
```

## Evaluation Workflow

Start with `make audit-all` to inventory skills and identify high-priority targets. For each skill requiring attention, run analysis with `analyze-skill` to map complexity. Generate an improvement plan, apply fixes, and run `check-compliance` to verify the skill meets project standards. Finalize by checking the token budget for efficiency.

## Evaluation and Optimization

Quality assessments use the `skills-auditor` and `improvement-suggester` to generate detailed reports. Performance analysis focuses on token efficiency through the `token-usage-tracker` and tool performance via `tool-performance-analyzer`. For standards compliance, the `compliance-checker` automates common fixes for structural issues.

### Scoring and Prioritization

We evaluate skills across five dimensions: structure compliance, content quality, token efficiency, activation reliability, and tool integration. Scores above 90 represent production-ready skills, while scores below 50 indicate critical issues requiring immediate attention.

Improvements are prioritized by impact. Critical issues include security vulnerabilities or broken functionality. High-priority items cover structural flaws that hinder discoverability. Medium and low priorities focus on best practices and minor optimizations.

### Structural Patterns

**Deprecated**: `skills/shared/modules/` directories. Shared modules must be relocated into the consuming skill's own `modules/` directory. The evaluator flags any remaining `skills/shared/` as a structural warning.

**Current**: Each skill owns its modules at `skills/<skill-name>/modules/`. Cross-skill references use relative paths (e.g., `../skill-authoring/modules/anti-rationalization.md`).

## Resources

### Shared Modules: Cross-Skill Patterns
- **Anti-Rationalization Patterns**: See [anti-rationalization.md](../skill-authoring/modules/anti-rationalization.md)
- **Enforcement Language**: See [enforcement-language.md](../shared-patterns/modules/workflow-patterns.md)
- **Trigger Patterns**: See [trigger-patterns.md](modules/evaluation-criteria.md)

### Skill-Specific Modules
- **Trigger Isolation Analysis**: See `modules/trigger-isolation-analysis.md`
- **Skill Authoring Best Practices**: See `modules/skill-authoring-best-practices.md`
- **Authoring Checklist**: See `modules/authoring-checklist.md`
- **Evaluation Workflows**: See `modules/evaluation-workflows.md`
- **Quality Metrics**: See `modules/quality-metrics.md`
- **Advanced Tool Use Analysis**: See `modules/advanced-tool-use-analysis.md`
- **Evaluation Framework**: See `modules/evaluation-framework.md`
- **Integration Patterns**: See `modules/integration.md`
- **Troubleshooting**: See `modules/troubleshooting.md`
- **Pressure Testing**: See `modules/pressure-testing.md`
- **Integration Testing**: See `modules/integration-testing.md`
- **Multi-Metric Evaluation**: See `modules/multi-metric-evaluation-methodology.md`
- **Performance Benchmarking**: See `modules/performance-benchmarking.md`

### Tools and Automation
- **Tools**: Executable analysis utilities in `scripts/` directory.
- **Automation**: Setup and validation scripts in `scripts/automation/`.

Overview

This skill audits Claude skills against a consistent quality standard to improve reliability, reduce token consumption, and prepare skills for production. It combines structural analysis, token-efficiency checks, and prioritized improvement suggestions to guide fixes. Use it as a final gate before shipping skills to production.

How this skill works

The evaluator runs automated structural checks, complexity analysis, and token-usage estimation across a skill’s files and modules. It ranks findings by impact and produces a prioritized improvement plan. After fixes are applied, compliance checks and runtime performance verifications confirm the issues are resolved.

When to use it

  • Before releasing a skill to production to confirm readiness.
  • During audits of existing skills to identify regressions or technical debt.
  • When optimizing token usage or reducing runtime costs.
  • When reviewing cross-skill module structure and integration patterns.
  • To validate fixes after CI or manual code changes.

Best practices

  • Run a full audit to inventory issues, then focus remediations by impact score.
  • Measure token budgets early and re-check after changes to prompts or tool calls.
  • Keep modules owned by the consuming skill to avoid shared-module drift.
  • Prioritize security and broken-functionality fixes above cosmetic changes.
  • Re-run compliance and performance checks after each non-trivial change.

Example use cases

  • Audit a newly completed skill to ensure it meets structure and token-efficiency standards before deployment.
  • Identify and remediate a high token-consumption prompt sequence that raises runtime costs.
  • Detect deprecated shared-module references and relocate modules into the consuming skill.
  • Generate a prioritized plan to fix reliability issues discovered during integration testing.
  • Verify that tool integrations meet activation reliability and performance thresholds after refactoring.

FAQ

Can I use this to create a new skill from scratch?

No. This skill is designed for evaluation and improvement. Use a dedicated authoring workflow for initial creation.

What are the most critical failure categories?

Security vulnerabilities, broken functionality, and structural defects that hinder discoverability are the highest priority.