home / skills / openai / skills / codex-readiness-integration-test

codex-readiness-integration-test skill

/skills/.experimental/codex-readiness-integration-test

This skill runs the Codex Readiness integration test end-to-end, validating agentic execution and producing structured evidence and summaries.

npx playbooks add skill openai/skills --skill codex-readiness-integration-test

Review the files below or copy the command above to add this skill to your agents.

Files (17)
SKILL.md
4.0 KB
---
name: codex-readiness-integration-test
description: Run the Codex Readiness integration test. Use when you need an end-to-end agentic loop with build/test scoring.
metadata:
  short-description: Run Codex Readiness integration test
---

# LLM Codex Readiness Integration Test

This skill runs a multi-stage integration test to validate agentic execution quality. It always runs in execute mode (no read-only mode).

## Outputs

Each run writes to `.codex-readiness-integration-test/<timestamp>/` and updates `.codex-readiness-integration-test/latest.json`.

New outputs per run:
- `agentic_summary.json` and `logs/agentic.log` (agentic loop execution)
- `llm_results.json` (automatic LLM evaluation)
- `summary.txt` (human-readable summary)

## Pre-conditions (Required)

- Authenticate with the Codex CLI using the repo-local HOME before running the test.
  Run these in your own terminal (not via the integration test):
  HOME=$PWD/.codex-home XDG_CACHE_HOME=$PWD/.codex-home/.cache codex login
  HOME=$PWD/.codex-home XDG_CACHE_HOME=$PWD/.codex-home/.cache codex login status
- The integration test creates {repo_root}/.codex-home and {repo_root}/.codex-home/.cache/codex as its first step.

## Workflow

0) Ask the user how to source the task.
   - Offer two explicit options: (a) user provides a custom task/prompt, or (b) auto-generate a task.
   - Do not run the entry point until the user chooses one option.
1) Generate or load `{out_dir}/prompt.pending.json`.
   - Use the integration test's expected prompt path, not `prompt.json` at the repo root.
   - With the default out dir, this path is `.codex-readiness-integration-test/prompt.pending.json`.
   - If `--seed-task` is provided, it is used as the starting task.
   - If not provided, generate a task with `skills/codex-readiness-integration-test/references/generate_prompt.md` and save the JSON to `{out_dir}/prompt.pending.json`.
   - The user must approve the prompt before execution (no auto-approve mode). Make sure to output a summary of the prompt when asking the user to approve.
2) Execute the agentic loop via Codex CLI (uses `AGENTS.md` and `change_prompt`).
3) Run build/test commands from the prompt plan via `skills/codex-readiness-integration-test/scripts/run_plan.py`.
4) Collect evidence (`evidence.json`), deterministic checks, and run automatic LLM evals via Codex CLI.
5) Score and write the report + summary output.

## Configuration

Optional fields in `{out_dir}/prompt.pending.json`:
- `agentic_loop`: configure Codex CLI invocation for the agentic loop.
- `llm_eval`: configure Codex CLI invocation for automatic evals.

If these fields are omitted, defaults are used.

## Requirements

- The LLM evaluator must fail if evidence mentions the phrase `Context compaction enabled`.
- Use qualitative context-usage evaluation (no strict thresholds).


## What this test covers well

- Runs Codex CLI against the real repo root, producing real filesystem edits and git diffs.
- Executes the approved change prompt and then runs the build/test plan in-repo.
- Captures evidence, deterministic checks, and LLM eval artifacts for review.

## What this test does not represent

- The agentic loop may use non-default flags (e.g., bypass approvals/sandbox), so interactive guardrails differ.
- Uses a dedicated HOME (`.codex-home`), which can change auth/config/cache vs normal CLI use.
- Auto-generated prompts and one-shot execution do not simulate interactive guidance.
- MCP servers/tools are not exercised unless explicitly configured.

## Notes

- The prompts in `skills/codex-readiness-integration-test/references/` expect strict JSON.
- Use `skills/codex-readiness-integration-test/references/json_fix.md` to repair invalid JSON output.
- This skill calls the `codex` CLI. Ensure it is installed and available on PATH, or override the command in `{out_dir}/prompt.pending.json`.
- If the agentic loop detects sandbox-blocked tool access, it now writes `requires_escalation: true` to `{run_dir}/agentic_summary.json` and exits with code `3`. Re-run the integration test with escalated permissions in that case.

Overview

This skill runs the Codex Readiness integration test to validate end-to-end agentic execution and build/test scoring. It produces structured artifacts, automatic LLM evaluations, and a human-readable summary for each run. The test always executes in write mode and writes outputs under a timestamped directory plus a latest.json pointer.

How this skill works

The skill prompts you to choose how the task is sourced (provide a custom prompt or auto-generate one) and saves the approved prompt to the expected pending path. It then invokes the codex CLI to run the agentic loop, executes the plan's build/test commands, collects deterministic evidence and LLM evaluations, scores results, and writes a report and summary. Outputs include agentic logs, llm_results.json, evidence.json, and a summary.txt in .codex-readiness-integration-test/<timestamp> and an updated latest.json.

When to use it

  • Validate an agentic loop end-to-end against a real repository with filesystem edits and test execution.
  • Run automated LLM-based qualitative evaluations of change evidence and plan outcomes.
  • Smoke-test agent behaviors and build/test integration after changes to agents, prompts, or CI scripts.
  • Generate reproducible artifacts for review, auditing, or debugging of agentic runs.

Best practices

  • Authenticate the Codex CLI beforehand using a repo-local HOME as described, running codex login from your terminal.
  • Do not run the integration entrypoint until you choose how to source the task and approve the prompt summary.
  • Use the provided generate_prompt.md if you need a seed-free auto-generated task, and seed tasks when reproducibility matters.
  • Review agentic_summary.json and logs/agentic.log for sandbox or escalation flags; re-run with escalated permissions if required.
  • Keep an eye on .codex-readiness-integration-test/latest.json to quickly find the most recent run artifacts.

Example use cases

  • Run a pre-release check to ensure an autonomous change loop can execute tests and produce evidence.
  • Compare agentic execution quality before and after prompt or agent configuration changes.
  • Reproduce a previously observed failing run by loading its prompt.pending.json and re-executing the test.
  • Collect artifacts and LLM evaluations to support a postmortem or audit of an automated change.

FAQ

What must I do before running the test?

Authenticate the codex CLI using a repo-local HOME and cache as described, running codex login and codex login status from your terminal.

Where are outputs written?

Each run writes a timestamped directory under .codex-readiness-integration-test/ with agentic_summary.json, logs, llm_results.json, evidence.json, summary.txt, and updates latest.json.

What happens if sandbox blocks tools?

The agentic summary will include requires_escalation: true and the run exits with code 3; re-run with escalated permissions to proceed.