home / skills / vadimcomanescu / codex-skills / senior-prompt-engineer

senior-prompt-engineer skill

/skills/.curated/ai/senior-prompt-engineer

This skill treats prompts as verifiable products, guiding deconstruction, tooling patterns, and evaluation design to produce reliable AI assistants.

npx playbooks add skill vadimcomanescu/codex-skills --skill senior-prompt-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

1.1 KB

---
name: senior-prompt-engineer
description: "Prompt engineering workflow for building reliable assistants and agents: task decomposition, instruction hierarchy, tool-use patterns, safety constraints, and evaluation design. Use when writing or refactoring system prompts, creating structured prompts, building prompt test suites, or debugging regressions in LLM behavior."
---

# Senior Prompt Engineer

Treat prompts like products: versioned, tested, and measurable.

## Quick Start
1) Define the job: inputs, outputs, and the “definition of done”.
2) Write the smallest prompt that:
   - states constraints clearly
   - defines output format
   - includes edge-case handling
3) Add examples only when needed (few-shot is expensive).
4) Create an eval set: representative cases + adversarial cases.
5) Iterate with diffs: change one thing, measure impact.

## Optional tool: scaffold a prompt + eval harness
```bash
python ~/.codex/skills/senior-prompt-engineer/scripts/scaffold_prompt_eval.py . --out evals/prompt_eval
```

## References
- Prompt review checklist: `references/prompt-review.md`

Overview

This skill provides a disciplined prompt engineering workflow for building reliable LLM assistants and agents. It focuses on task decomposition, instruction hierarchies, tool-use patterns, safety constraints, and evaluation design to produce predictable, testable prompts. The goal is to treat prompts like products: versioned, tested, and measurable.

How this skill works

The skill guides you to define the job precisely (inputs, outputs, and a clear definition of done) and craft the smallest prompt that expresses constraints, output format, and edge-case handling. It emphasizes adding examples only when necessary and constructing a representative eval set that includes adversarial cases. Iteration is done via controlled diffs and measurable evaluations to track regressions.

When to use it

Writing or refactoring system prompts for production assistants
Designing structured prompts for multi-step agents or tool use
Creating prompt test suites and evaluation harnesses
Debugging regressions or unpredictable LLM behavior
Defining safety constraints and instruction hierarchies for models

Best practices

Define the job up front: list inputs, desired outputs, and a concrete definition of done
Keep prompts minimal: state constraints and output format explicitly before adding examples
Use few-shot examples sparingly; prefer clear instructions and edge-case rules
Build an eval set with representative and adversarial cases to catch regressions
Change one element at a time and measure its impact with automated tests
Version prompts and record diffs so rollbacks and comparisons are reproducible

Example use cases

Scaffold a new system prompt for a customer-support agent with required response fields and safety checks
Refactor a multi-tool agent prompt to separate task decomposition from tool-invocation instructions
Create an automated eval harness with representative and adversarial tests to validate prompt changes
Debug an LLM regression by iterating single-change diffs and measuring evaluation metrics
Define instruction hierarchies and guardrails for high-risk outputs (legal, medical, financial)

FAQ

How many examples should I include in a prompt?

Include examples only when they demonstrably improve output quality; start with zero and add 1–3 targeted examples if specific formatting or style fails.

How do I catch subtle regressions?

Use a dedicated eval suite with both representative and adversarial cases, run it on each prompt change, and apply single-element diffs so you can attribute effects to individual edits.