home / skills / yonatangross / orchestkit / prompt-engineering-suite

prompt-engineering-suite skill

/plugins/ork/skills/prompt-engineering-suite

This skill designs and optimizes production prompts using chain-of-thought, few-shot learning, and versioning to improve LLM accuracy.

npx playbooks add skill yonatangross/orchestkit --skill prompt-engineering-suite

Review the files below or copy the command above to add this skill to your agents.

Files (9)
SKILL.md
7.0 KB
---
name: prompt-engineering-suite
description: Comprehensive prompt engineering with Chain-of-Thought, few-shot learning, prompt versioning, and optimization. Use when designing prompts, improving accuracy, managing prompt lifecycle.
version: 1.0.0
tags: [prompts, cot, few-shot, versioning, optimization, langfuse, dspy, 2026]
context: fork
agent: prompt-engineer
author: OrchestKit
user-invocable: false
---

# Prompt Engineering Suite

Design, version, and optimize prompts for production LLM applications.

## Overview

- Designing prompts for new LLM features
- Improving accuracy with Chain-of-Thought reasoning
- Few-shot learning with example selection
- Managing prompts in production (versioning, A/B testing)
- Automatic prompt optimization with DSPy

## Quick Reference

### Chain-of-Thought Pattern

```python
from langchain_core.prompts import ChatPromptTemplate

COT_SYSTEM = """You are a helpful assistant that solves problems step-by-step.

When solving problems:
1. Break down the problem into clear steps
2. Show your reasoning for each step
3. Verify your answer before responding
4. If uncertain, acknowledge limitations

Format your response as:
STEP 1: [description]
Reasoning: [your thought process]

STEP 2: [description]
Reasoning: [your thought process]

...

FINAL ANSWER: [your conclusion]"""

cot_prompt = ChatPromptTemplate.from_messages([
    ("system", COT_SYSTEM),
    ("human", "Problem: {problem}\n\nThink through this step-by-step."),
])
```

### Few-Shot with Dynamic Examples

```python
from langchain_core.prompts import FewShotChatMessagePromptTemplate

examples = [
    {"input": "What is 2+2?", "output": "4"},
    {"input": "What is the capital of France?", "output": "Paris"},
]

few_shot = FewShotChatMessagePromptTemplate(
    examples=examples,
    example_prompt=ChatPromptTemplate.from_messages([
        ("human", "{input}"),
        ("ai", "{output}"),
    ]),
)

final_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer concisely."),
    few_shot,
    ("human", "{input}"),
])
```

### Prompt Versioning with Langfuse SDK v3

```python
from langfuse import Langfuse
# Note: Langfuse SDK v3 is OTEL-native (acquired by ClickHouse Jan 2026)

langfuse = Langfuse()

# Get versioned prompt with label
prompt = langfuse.get_prompt(
    name="customer-support-v2",
    label="production",  # production, staging, canary
    cache_ttl_seconds=300,
)

# Compile with variables
compiled = prompt.compile(
    customer_name="John",
    issue="billing question"
)
```

### DSPy 3.1.0 Automatic Optimization

```python
import dspy

class OptimizedQA(dspy.Module):
    def __init__(self):
        self.generate = dspy.Predict("question -> answer")

    def forward(self, question):
        return self.generate(question=question)

# Optimize with MIPROv2 (recommended) or BootstrapFewShot
optimizer = dspy.MIPROv2(metric=answer_match)  # Data+demo-aware Bayesian optimization
optimized = optimizer.compile(OptimizedQA(), trainset=examples)

# Alternative: GEPA (July 2025) - Reflective Prompt Evolution
# Uses model introspection to analyze failures and propose better prompts
```

## Pattern Selection Guide

| Pattern | When to Use | Example Use Case |
|---------|-------------|------------------|
| Zero-shot | Simple, well-defined tasks | Classification, extraction |
| Few-shot | Complex tasks needing examples | Format conversion, style matching |
| CoT | Reasoning, math, logic | Problem solving, analysis |
| Zero-shot CoT | Quick reasoning boost | Add "Let's think step by step" |
| ReAct | Tool use, multi-step | Agent tasks, API calls |
| Structured | JSON/schema output | Data extraction, API responses |

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Few-shot examples | 3-5 diverse, representative examples |
| Example ordering | Most similar examples last (recency bias) |
| CoT trigger | "Let's think step by step" or explicit format |
| Prompt versioning | Langfuse with labels (production/staging) |
| A/B testing | 50+ samples, track via trace metadata |
| Auto-optimization | DSPy BootstrapFewShot for few-shot tuning |

## Anti-Patterns (FORBIDDEN)

```python
# NEVER hardcode prompts without versioning
PROMPT = "You are a helpful assistant..."  # No version control!

# NEVER use single example for few-shot
examples = [{"input": "x", "output": "y"}]  # Too few!

# NEVER skip CoT for complex reasoning
response = llm.complete("Solve: 15% of 240")  # No reasoning!

# ALWAYS version prompts
prompt = langfuse.get_prompt("assistant", label="production")

# ALWAYS use 3-5 diverse examples
examples = [ex1, ex2, ex3, ex4, ex5]

# ALWAYS use CoT for math/logic
response = llm.complete("Solve: 15% of 240. Think step by step.")
```

## Detailed Documentation

| Resource | Description |
|----------|-------------|
| [references/chain-of-thought.md](references/chain-of-thought.md) | CoT patterns, zero-shot CoT, self-consistency |
| [references/few-shot-patterns.md](references/few-shot-patterns.md) | Example selection, ordering, formatting |
| [references/prompt-versioning.md](references/prompt-versioning.md) | Langfuse integration, A/B testing |
| [references/prompt-optimization.md](references/prompt-optimization.md) | DSPy, automatic tuning, evaluation |
| [scripts/cot-template.py](scripts/cot-template.py) | Full Chain-of-Thought implementation |
| [scripts/few-shot-template.py](scripts/few-shot-template.py) | Few-shot with dynamic example selection |
| [scripts/jinja2-prompts.py](scripts/jinja2-prompts.py) | Jinja2 templates (2026): async, caching, LLM filters, Anthropic format |

## Related Skills

- `langfuse-observability` - Prompt management and A/B testing tracking
- `llm-evaluation` - Evaluating prompt effectiveness
- `function-calling` - Structured output patterns
- `llm-testing` - Testing prompt variations

## Capability Details

### chain-of-thought
**Keywords:** CoT, step by step, reasoning, think, chain of thought
**Solves:**
- Improve accuracy on complex reasoning tasks
- Debug LLM reasoning process
- Implement self-consistency with multiple CoT paths

### few-shot-learning
**Keywords:** few-shot, examples, in-context learning, demonstrations
**Solves:**
- Format LLM output with examples
- Handle complex tasks without fine-tuning
- Select optimal examples for task

### prompt-versioning
**Keywords:** version, prompt management, A/B test, production prompt
**Solves:**
- Manage prompts in production
- A/B test prompt variations
- Roll back to previous versions

### prompt-optimization
**Keywords:** DSPy, optimize, tune, automatic prompt, OPRO
**Solves:**
- Automatically optimize prompts
- Find best few-shot examples
- Improve accuracy without manual tuning

### zero-shot-cot
**Keywords:** zero-shot CoT, think step by step, reasoning trigger
**Solves:**
- Quick reasoning boost without examples
- Add "Let's think step by step" trigger
- Improve accuracy on math/logic

### self-consistency
**Keywords:** self-consistency, multiple paths, voting, ensemble
**Solves:**
- Generate multiple reasoning paths
- Vote on most common answer
- Improve reliability on hard problems

Overview

This skill is a comprehensive prompt engineering suite for designing, versioning, and optimizing prompts in production LLM applications. It bundles Chain-of-Thought patterns, few-shot learning utilities, prompt versioning workflows, and automatic optimization tools to improve accuracy and maintainability. Use it to build repeatable, auditable prompt pipelines and accelerate iteration from prototype to production.

How this skill works

The suite provides reusable patterns and templates: Chain-of-Thought templates for stepwise reasoning, few-shot builders with dynamic example selection, and integrations for prompt versioning via Langfuse-style labels. It also supports automatic prompt optimization using DSPy-style optimizers (MIPROv2, BootstrapFewShot, GEPA) to search for high-performing prompt variants. Telemetry and A/B testing hooks enable comparisons and rollbacks in production.

When to use it

  • Designing prompts for new LLM-driven features that require reliability and traceability
  • Improving accuracy on complex reasoning, math, or logic tasks using Chain-of-Thought
  • Formatting or style-sensitive tasks where few-shot examples guide output
  • Managing prompt lifecycle in production with versioning and A/B testing
  • Automating prompt tuning when manual iteration is slow or costly

Best practices

  • Use 3–5 diverse few-shot examples, ordering most similar examples last to leverage recency bias
  • Apply Chain-of-Thought templates for multi-step reasoning and verify answers before finalizing
  • Version every production prompt and label environments (production/staging/canary)
  • Run A/B tests with 50+ samples and capture trace metadata for reliable comparisons
  • Prefer automatic optimizers (MIPROv2 or BootstrapFewShot) to explore prompt space before manual tweaks

Example use cases

  • Customer support: craft versioned prompts for issue summarization and triage with A/B testing
  • Education: use CoT to generate step-by-step solutions and self-consistency voting for accuracy
  • Data extraction: structured prompts that output JSON/schema-validated responses
  • Agent orchestration: couple ReAct-style prompts with tools for multi-step API workflows
  • Prompt tuning: run DSPy optimizers to select best few-shot examples and hyperparameters

FAQ

When should I prefer few-shot over zero-shot?

Use few-shot when task requires formatting, style alignment, or examples improve accuracy; use zero-shot for simple, well-defined tasks.

How many examples should I include for few-shot?

Include 3–5 diverse, representative examples and order them so the most similar examples appear last.