home / skills / petekp / claude-code-setup / agent-ready-eval
/skills/agent-ready-eval
This skill evaluates a codebase for autonomous agent readiness by auditing environment isolation, interfaces, state management, and observability.
npx playbooks add skill petekp/claude-code-setup --skill agent-ready-evalReview the files below or copy the command above to add this skill to your agents.
---
name: agent-ready-eval
description: Evaluate a codebase for agent-friendliness based on autonomous agent best practices. Use when asked to "evaluate for agents", "check agent readiness", "audit for autonomous execution", "assess agent-friendliness", or when reviewing infrastructure for unattended agent operation. Also use when asked about making a codebase more suitable for AI agents or autonomous workflows.
---
# Agent-Ready Evaluation
Evaluate how well a codebase supports autonomous agent execution based on the "How to Get Out of Your Agent's Way" principles.
## Core Philosophy
Autonomous agents fail for predictable reasons—most are system design failures, not model failures. This evaluation checks whether infrastructure enables true autonomy: agents that run unattended, isolated, reproducible, and bounded by system constraints rather than human intervention.
## Evaluation Process
### 1. Gather Evidence
Explore the codebase for indicators across all 12 principles. Key files to examine:
**Environment & Isolation:**
- `Dockerfile`, `docker-compose.yml`, `.devcontainer/`
- `Makefile`, `setup.sh`, `bootstrap.sh`
- CI configs (`.github/workflows/`, `.gitlab-ci.yml`, `Jenkinsfile`)
- Nix files, `devbox.json`, `flake.nix`
**Dependencies & State:**
- Lockfiles (`package-lock.json`, `yarn.lock`, `Pipfile.lock`, `Cargo.lock`, `go.sum`)
- Database configs, migration files, seed scripts
- `.env.example`, config templates
**Execution & Interfaces:**
- CLI entry points, `bin/` scripts
- API definitions, OpenAPI specs
- Background job configs (Sidekiq, Celery, Bull)
- Timeout/limit configurations
**Quality & Monitoring:**
- Test suites, benchmark files
- Logging configuration
- Cost tracking, rate limiting setup
### 2. Score Each Principle
Read [evaluation-criteria.md](references/evaluation-criteria.md) for detailed scoring rubric.
Score each of the 12 principles 0-3:
- **3**: Fully implemented with clear evidence
- **2**: Partially implemented, room for improvement
- **1**: Minimal awareness, significant gaps
- **0**: No evidence
### 3. Generate Report
Output format:
```markdown
# Agent-Ready Evaluation Report
**Overall Score: X/36** (Y%)
**Rating: [Excellent|Good|Needs Work|Not Agent-Ready]**
## Summary
[2-3 sentence assessment of overall agent-readiness]
## Principle Scores
| Principle | Score | Evidence |
|-----------|-------|----------|
| 1. Sandbox Everything | X/3 | [brief evidence] |
| 2. No External DB Dependencies | X/3 | [brief evidence] |
| 3. Clean Environment | X/3 | [brief evidence] |
| 4. Session-Independent Execution | X/3 | [brief evidence] |
| 5. Outcome-Based Instructions | X/3 | [brief evidence] |
| 6. Direct Low-Level Interfaces | X/3 | [brief evidence] |
| 7. Minimal Framework Overhead | X/3 | [brief evidence] |
| 8. Explicit State Persistence | X/3 | [brief evidence] |
| 9. Early Benchmarks | X/3 | [brief evidence] |
| 10. Cost Planning | X/3 | [brief evidence] |
| 11. Verifiable Output | X/3 | [brief evidence] |
| 12. Infrastructure-Bounded Permissions | X/3 | [brief evidence] |
## Top 3 Improvements
1. **[Highest impact improvement]**
- Current state: ...
- Recommendation: ...
- Impact: ...
2. **[Second improvement]**
...
3. **[Third improvement]**
...
## Strengths
- [What the codebase does well for agents]
## Detailed Findings
[Optional: deeper analysis of specific areas]
```
## Rating Scale
- **30-36 (83-100%)**: Excellent - Ready for autonomous agent execution
- **24-29 (67-82%)**: Good - Minor improvements needed
- **18-23 (50-66%)**: Needs Work - Significant gaps to address
- **0-17 (<50%)**: Not Agent-Ready - Major architectural changes needed
## Quick Checks
If time is limited, prioritize these high-signal indicators:
1. **Dockerfile exists?** → Sandboxing potential
2. **Lockfiles present?** → Reproducibility
3. **No external DB in default config?** → Isolation
4. **CLI scripts in bin/ or Makefile?** → Direct interfaces
5. **Tests with assertions?** → Verifiable output
This skill evaluates a codebase for agent-friendliness using autonomous-agent best practices. It inspects repository artifacts, runtime configuration, and infrastructure constraints to determine how well the project supports unattended, reproducible, and bounded autonomous execution. Results are given as per-principle scores and clear, actionable recommendations.
The evaluator scans repository files and CI/configuration artifacts to gather evidence across 12 agent-readiness principles. It checks for sandboxing (Docker, devcontainers), reproducibility (lockfiles, build scripts), isolation from external services, direct execution interfaces (CLI, Makefile), explicit persistence, time/cost limits, and verifiable outputs (tests, benchmarks). Each principle is scored 0–3 and compiled into a concise report with an overall rating and prioritized improvements.
What files does the evaluator look for first?
It prioritizes Dockerfile/devcontainer, lockfiles, CLI/Make targets, CI workflows, and .env/config templates as high-signal indicators.
How are scores interpreted?
Each of 12 principles is scored 0–3. Totals map to four ratings: Excellent, Good, Needs Work, or Not Agent-Ready, with clear thresholds for prioritization.
Can this assessment be done quickly?
Yes. Use quick checks first: Dockerfile presence, lockfiles, lack of external DB defaults, CLI scripts, and tests. Those give a fast high-signal snapshot.