home / skills / vadimcomanescu / codex-skills / senior-prompt-engineer

senior-prompt-engineer skill

/skills/.curated/ai/senior-prompt-engineer

This skill treats prompts as verifiable products, guiding deconstruction, tooling patterns, and evaluation design to produce reliable AI assistants.

npx playbooks add skill vadimcomanescu/codex-skills --skill senior-prompt-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
1.1 KB
---
name: senior-prompt-engineer
description: "Prompt engineering workflow for building reliable assistants and agents: task decomposition, instruction hierarchy, tool-use patterns, safety constraints, and evaluation design. Use when writing or refactoring system prompts, creating structured prompts, building prompt test suites, or debugging regressions in LLM behavior."
---

# Senior Prompt Engineer

Treat prompts like products: versioned, tested, and measurable.

## Quick Start
1) Define the job: inputs, outputs, and the “definition of done”.
2) Write the smallest prompt that:
   - states constraints clearly
   - defines output format
   - includes edge-case handling
3) Add examples only when needed (few-shot is expensive).
4) Create an eval set: representative cases + adversarial cases.
5) Iterate with diffs: change one thing, measure impact.

## Optional tool: scaffold a prompt + eval harness
```bash
python ~/.codex/skills/senior-prompt-engineer/scripts/scaffold_prompt_eval.py . --out evals/prompt_eval
```

## References
- Prompt review checklist: `references/prompt-review.md`

Overview

This skill provides a disciplined prompt engineering workflow for building reliable LLM assistants and agents. It focuses on task decomposition, instruction hierarchies, tool-use patterns, safety constraints, and evaluation design to produce predictable, testable prompts. The goal is to treat prompts like products: versioned, tested, and measurable.

How this skill works

The skill guides you to define the job precisely (inputs, outputs, and a clear definition of done) and craft the smallest prompt that expresses constraints, output format, and edge-case handling. It emphasizes adding examples only when necessary and constructing a representative eval set that includes adversarial cases. Iteration is done via controlled diffs and measurable evaluations to track regressions.

When to use it

  • Writing or refactoring system prompts for production assistants
  • Designing structured prompts for multi-step agents or tool use
  • Creating prompt test suites and evaluation harnesses
  • Debugging regressions or unpredictable LLM behavior
  • Defining safety constraints and instruction hierarchies for models

Best practices

  • Define the job up front: list inputs, desired outputs, and a concrete definition of done
  • Keep prompts minimal: state constraints and output format explicitly before adding examples
  • Use few-shot examples sparingly; prefer clear instructions and edge-case rules
  • Build an eval set with representative and adversarial cases to catch regressions
  • Change one element at a time and measure its impact with automated tests
  • Version prompts and record diffs so rollbacks and comparisons are reproducible

Example use cases

  • Scaffold a new system prompt for a customer-support agent with required response fields and safety checks
  • Refactor a multi-tool agent prompt to separate task decomposition from tool-invocation instructions
  • Create an automated eval harness with representative and adversarial tests to validate prompt changes
  • Debug an LLM regression by iterating single-change diffs and measuring evaluation metrics
  • Define instruction hierarchies and guardrails for high-risk outputs (legal, medical, financial)

FAQ

How many examples should I include in a prompt?

Include examples only when they demonstrably improve output quality; start with zero and add 1–3 targeted examples if specific formatting or style fails.

How do I catch subtle regressions?

Use a dedicated eval suite with both representative and adversarial cases, run it on each prompt change, and apply single-element diffs so you can attribute effects to individual edits.