home / skills / vadimcomanescu / codex-skills / senior-prompt-engineer
This skill treats prompts as verifiable products, guiding deconstruction, tooling patterns, and evaluation design to produce reliable AI assistants.
npx playbooks add skill vadimcomanescu/codex-skills --skill senior-prompt-engineerReview the files below or copy the command above to add this skill to your agents.
---
name: senior-prompt-engineer
description: "Prompt engineering workflow for building reliable assistants and agents: task decomposition, instruction hierarchy, tool-use patterns, safety constraints, and evaluation design. Use when writing or refactoring system prompts, creating structured prompts, building prompt test suites, or debugging regressions in LLM behavior."
---
# Senior Prompt Engineer
Treat prompts like products: versioned, tested, and measurable.
## Quick Start
1) Define the job: inputs, outputs, and the “definition of done”.
2) Write the smallest prompt that:
- states constraints clearly
- defines output format
- includes edge-case handling
3) Add examples only when needed (few-shot is expensive).
4) Create an eval set: representative cases + adversarial cases.
5) Iterate with diffs: change one thing, measure impact.
## Optional tool: scaffold a prompt + eval harness
```bash
python ~/.codex/skills/senior-prompt-engineer/scripts/scaffold_prompt_eval.py . --out evals/prompt_eval
```
## References
- Prompt review checklist: `references/prompt-review.md`
This skill provides a disciplined prompt engineering workflow for building reliable LLM assistants and agents. It focuses on task decomposition, instruction hierarchies, tool-use patterns, safety constraints, and evaluation design to produce predictable, testable prompts. The goal is to treat prompts like products: versioned, tested, and measurable.
The skill guides you to define the job precisely (inputs, outputs, and a clear definition of done) and craft the smallest prompt that expresses constraints, output format, and edge-case handling. It emphasizes adding examples only when necessary and constructing a representative eval set that includes adversarial cases. Iteration is done via controlled diffs and measurable evaluations to track regressions.
How many examples should I include in a prompt?
Include examples only when they demonstrably improve output quality; start with zero and add 1–3 targeted examples if specific formatting or style fails.
How do I catch subtle regressions?
Use a dedicated eval suite with both representative and adversarial cases, run it on each prompt change, and apply single-element diffs so you can attribute effects to individual edits.