home / skills / petekp / claude-code-setup / agent-ready-eval

agent-ready-eval skill

/skills/agent-ready-eval

This skill evaluates a codebase for autonomous agent readiness by auditing environment isolation, interfaces, state management, and observability.

npx playbooks add skill petekp/claude-code-setup --skill agent-ready-eval

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
3.9 KB
---
name: agent-ready-eval
description: Evaluate a codebase for agent-friendliness based on autonomous agent best practices. Use when asked to "evaluate for agents", "check agent readiness", "audit for autonomous execution", "assess agent-friendliness", or when reviewing infrastructure for unattended agent operation. Also use when asked about making a codebase more suitable for AI agents or autonomous workflows.
---

# Agent-Ready Evaluation

Evaluate how well a codebase supports autonomous agent execution based on the "How to Get Out of Your Agent's Way" principles.

## Core Philosophy

Autonomous agents fail for predictable reasons—most are system design failures, not model failures. This evaluation checks whether infrastructure enables true autonomy: agents that run unattended, isolated, reproducible, and bounded by system constraints rather than human intervention.

## Evaluation Process

### 1. Gather Evidence

Explore the codebase for indicators across all 12 principles. Key files to examine:

**Environment & Isolation:**
- `Dockerfile`, `docker-compose.yml`, `.devcontainer/`
- `Makefile`, `setup.sh`, `bootstrap.sh`
- CI configs (`.github/workflows/`, `.gitlab-ci.yml`, `Jenkinsfile`)
- Nix files, `devbox.json`, `flake.nix`

**Dependencies & State:**
- Lockfiles (`package-lock.json`, `yarn.lock`, `Pipfile.lock`, `Cargo.lock`, `go.sum`)
- Database configs, migration files, seed scripts
- `.env.example`, config templates

**Execution & Interfaces:**
- CLI entry points, `bin/` scripts
- API definitions, OpenAPI specs
- Background job configs (Sidekiq, Celery, Bull)
- Timeout/limit configurations

**Quality & Monitoring:**
- Test suites, benchmark files
- Logging configuration
- Cost tracking, rate limiting setup

### 2. Score Each Principle

Read [evaluation-criteria.md](references/evaluation-criteria.md) for detailed scoring rubric.

Score each of the 12 principles 0-3:
- **3**: Fully implemented with clear evidence
- **2**: Partially implemented, room for improvement
- **1**: Minimal awareness, significant gaps
- **0**: No evidence

### 3. Generate Report

Output format:

```markdown
# Agent-Ready Evaluation Report

**Overall Score: X/36** (Y%)
**Rating: [Excellent|Good|Needs Work|Not Agent-Ready]**

## Summary
[2-3 sentence assessment of overall agent-readiness]

## Principle Scores

| Principle | Score | Evidence |
|-----------|-------|----------|
| 1. Sandbox Everything | X/3 | [brief evidence] |
| 2. No External DB Dependencies | X/3 | [brief evidence] |
| 3. Clean Environment | X/3 | [brief evidence] |
| 4. Session-Independent Execution | X/3 | [brief evidence] |
| 5. Outcome-Based Instructions | X/3 | [brief evidence] |
| 6. Direct Low-Level Interfaces | X/3 | [brief evidence] |
| 7. Minimal Framework Overhead | X/3 | [brief evidence] |
| 8. Explicit State Persistence | X/3 | [brief evidence] |
| 9. Early Benchmarks | X/3 | [brief evidence] |
| 10. Cost Planning | X/3 | [brief evidence] |
| 11. Verifiable Output | X/3 | [brief evidence] |
| 12. Infrastructure-Bounded Permissions | X/3 | [brief evidence] |

## Top 3 Improvements

1. **[Highest impact improvement]**
   - Current state: ...
   - Recommendation: ...
   - Impact: ...

2. **[Second improvement]**
   ...

3. **[Third improvement]**
   ...

## Strengths
- [What the codebase does well for agents]

## Detailed Findings
[Optional: deeper analysis of specific areas]
```

## Rating Scale

- **30-36 (83-100%)**: Excellent - Ready for autonomous agent execution
- **24-29 (67-82%)**: Good - Minor improvements needed
- **18-23 (50-66%)**: Needs Work - Significant gaps to address
- **0-17 (<50%)**: Not Agent-Ready - Major architectural changes needed

## Quick Checks

If time is limited, prioritize these high-signal indicators:

1. **Dockerfile exists?** → Sandboxing potential
2. **Lockfiles present?** → Reproducibility
3. **No external DB in default config?** → Isolation
4. **CLI scripts in bin/ or Makefile?** → Direct interfaces
5. **Tests with assertions?** → Verifiable output

Overview

This skill evaluates a codebase for agent-friendliness using autonomous-agent best practices. It inspects repository artifacts, runtime configuration, and infrastructure constraints to determine how well the project supports unattended, reproducible, and bounded autonomous execution. Results are given as per-principle scores and clear, actionable recommendations.

How this skill works

The evaluator scans repository files and CI/configuration artifacts to gather evidence across 12 agent-readiness principles. It checks for sandboxing (Docker, devcontainers), reproducibility (lockfiles, build scripts), isolation from external services, direct execution interfaces (CLI, Makefile), explicit persistence, time/cost limits, and verifiable outputs (tests, benchmarks). Each principle is scored 0–3 and compiled into a concise report with an overall rating and prioritized improvements.

When to use it

  • When asked to “evaluate for agents” or “check agent readiness”
  • Before enabling unattended autonomous workflows or CI-driven agents
  • When onboarding AI agents that must run without human intervention
  • When designing or refactoring infra to support agent execution
  • When reviewing security, cost, or reliability constraints for agents

Best practices

  • Provide a Dockerfile or devcontainer to guarantee sandboxed runtime
  • Include dependency lockfiles and reproducible build scripts
  • Avoid default external DBs; provide local in-memory or seeded options
  • Expose direct CLI or Make targets for common operations
  • Add tests and benchmarks to make outputs verifiable
  • Declare timeouts, rate limits, and cost constraints in config

Example use cases

  • Audit a personal Claude Code repo for unattended execution readiness
  • Prepare a microservice to be safely operated by autonomous agents
  • Assess CI pipelines and devcontainers for agent sandboxing
  • Convert a developer-only workflow into a reproducible agent-friendly process
  • Prioritize infrastructure changes that enable safe autonomous runs

FAQ

What files does the evaluator look for first?

It prioritizes Dockerfile/devcontainer, lockfiles, CLI/Make targets, CI workflows, and .env/config templates as high-signal indicators.

How are scores interpreted?

Each of 12 principles is scored 0–3. Totals map to four ratings: Excellent, Good, Needs Work, or Not Agent-Ready, with clear thresholds for prioritization.

Can this assessment be done quickly?

Yes. Use quick checks first: Dockerfile presence, lockfiles, lack of external DB defaults, CLI scripts, and tests. Those give a fast high-signal snapshot.