home / skills / petekp / claude-code-setup / agent-ready-eval

agent-ready-eval skill

safe

/skills/agent-ready-eval

This skill evaluates a codebase for autonomous agent readiness by auditing environment isolation, interfaces, state management, and observability.

npx playbooks add skill petekp/claude-code-setup --skill agent-ready-eval

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.9 KB

---
name: agent-ready-eval
description: Evaluate a codebase for agent-friendliness based on autonomous agent best practices. Use when asked to "evaluate for agents", "check agent readiness", "audit for autonomous execution", "assess agent-friendliness", or when reviewing infrastructure for unattended agent operation. Also use when asked about making a codebase more suitable for AI agents or autonomous workflows.
---

# Agent-Ready Evaluation

Evaluate how well a codebase supports autonomous agent execution based on the "How to Get Out of Your Agent's Way" principles.

## Core Philosophy

Autonomous agents fail for predictable reasons—most are system design failures, not model failures. This evaluation checks whether infrastructure enables true autonomy: agents that run unattended, isolated, reproducible, and bounded by system constraints rather than human intervention.

## Evaluation Process

### 1. Gather Evidence

Explore the codebase for indicators across all 12 principles. Key files to examine:

**Environment & Isolation:**
- `Dockerfile`, `docker-compose.yml`, `.devcontainer/`
- `Makefile`, `setup.sh`, `bootstrap.sh`
- CI configs (`.github/workflows/`, `.gitlab-ci.yml`, `Jenkinsfile`)
- Nix files, `devbox.json`, `flake.nix`

**Dependencies & State:**
- Lockfiles (`package-lock.json`, `yarn.lock`, `Pipfile.lock`, `Cargo.lock`, `go.sum`)
- Database configs, migration files, seed scripts
- `.env.example`, config templates

**Execution & Interfaces:**
- CLI entry points, `bin/` scripts
- API definitions, OpenAPI specs
- Background job configs (Sidekiq, Celery, Bull)
- Timeout/limit configurations

**Quality & Monitoring:**
- Test suites, benchmark files
- Logging configuration
- Cost tracking, rate limiting setup

### 2. Score Each Principle

Read [evaluation-criteria.md](references/evaluation-criteria.md) for detailed scoring rubric.

Score each of the 12 principles 0-3:
- **3**: Fully implemented with clear evidence
- **2**: Partially implemented, room for improvement
- **1**: Minimal awareness, significant gaps
- **0**: No evidence

### 3. Generate Report

Output format:

```markdown
# Agent-Ready Evaluation Report

**Overall Score: X/36** (Y%)
**Rating: [Excellent|Good|Needs Work|Not Agent-Ready]**

## Summary
[2-3 sentence assessment of overall agent-readiness]

## Principle Scores

| Principle | Score | Evidence |
|-----------|-------|----------|
| 1. Sandbox Everything | X/3 | [brief evidence] |
| 2. No External DB Dependencies | X/3 | [brief evidence] |
| 3. Clean Environment | X/3 | [brief evidence] |
| 4. Session-Independent Execution | X/3 | [brief evidence] |
| 5. Outcome-Based Instructions | X/3 | [brief evidence] |
| 6. Direct Low-Level Interfaces | X/3 | [brief evidence] |
| 7. Minimal Framework Overhead | X/3 | [brief evidence] |
| 8. Explicit State Persistence | X/3 | [brief evidence] |
| 9. Early Benchmarks | X/3 | [brief evidence] |
| 10. Cost Planning | X/3 | [brief evidence] |
| 11. Verifiable Output | X/3 | [brief evidence] |
| 12. Infrastructure-Bounded Permissions | X/3 | [brief evidence] |

## Top 3 Improvements

1. **[Highest impact improvement]**
   - Current state: ...
   - Recommendation: ...
   - Impact: ...

2. **[Second improvement]**
   ...

3. **[Third improvement]**
   ...

## Strengths
- [What the codebase does well for agents]

## Detailed Findings
[Optional: deeper analysis of specific areas]
```

## Rating Scale

- **30-36 (83-100%)**: Excellent - Ready for autonomous agent execution
- **24-29 (67-82%)**: Good - Minor improvements needed
- **18-23 (50-66%)**: Needs Work - Significant gaps to address
- **0-17 (<50%)**: Not Agent-Ready - Major architectural changes needed

## Quick Checks

If time is limited, prioritize these high-signal indicators:

1. **Dockerfile exists?** → Sandboxing potential
2. **Lockfiles present?** → Reproducibility
3. **No external DB in default config?** → Isolation
4. **CLI scripts in bin/ or Makefile?** → Direct interfaces
5. **Tests with assertions?** → Verifiable output

Overview

This skill evaluates a codebase for agent-friendliness using autonomous-agent best practices. It inspects repository artifacts, runtime configuration, and infrastructure constraints to determine how well the project supports unattended, reproducible, and bounded autonomous execution. Results are given as per-principle scores and clear, actionable recommendations.

How this skill works

The evaluator scans repository files and CI/configuration artifacts to gather evidence across 12 agent-readiness principles. It checks for sandboxing (Docker, devcontainers), reproducibility (lockfiles, build scripts), isolation from external services, direct execution interfaces (CLI, Makefile), explicit persistence, time/cost limits, and verifiable outputs (tests, benchmarks). Each principle is scored 0–3 and compiled into a concise report with an overall rating and prioritized improvements.

When to use it

When asked to “evaluate for agents” or “check agent readiness”
Before enabling unattended autonomous workflows or CI-driven agents
When onboarding AI agents that must run without human intervention
When designing or refactoring infra to support agent execution
When reviewing security, cost, or reliability constraints for agents

Best practices

Provide a Dockerfile or devcontainer to guarantee sandboxed runtime
Include dependency lockfiles and reproducible build scripts
Avoid default external DBs; provide local in-memory or seeded options
Expose direct CLI or Make targets for common operations
Add tests and benchmarks to make outputs verifiable
Declare timeouts, rate limits, and cost constraints in config

Example use cases

Audit a personal Claude Code repo for unattended execution readiness
Prepare a microservice to be safely operated by autonomous agents
Assess CI pipelines and devcontainers for agent sandboxing
Convert a developer-only workflow into a reproducible agent-friendly process
Prioritize infrastructure changes that enable safe autonomous runs

FAQ

What files does the evaluator look for first?

It prioritizes Dockerfile/devcontainer, lockfiles, CLI/Make targets, CI workflows, and .env/config templates as high-signal indicators.

How are scores interpreted?

Each of 12 principles is scored 0–3. Totals map to four ratings: Excellent, Good, Needs Work, or Not Agent-Ready, with clear thresholds for prioritization.

Can this assessment be done quickly?

Yes. Use quick checks first: Dockerfile presence, lockfiles, lack of external DB defaults, CLI scripts, and tests. Those give a fast high-signal snapshot.