home / skills / daymade / claude-code-skills / promptfoo-evaluation

promptfoo-evaluation skill

safe

This skill guides configuring and running LLM evaluations with Promptfoo, enabling prompt testing, rubric-based judging, and custom assertions.

npx playbooks add skill daymade/claude-code-skills --skill promptfoo-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

9.6 KB

---
name: promptfoo-evaluation
description: Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
---

# Promptfoo Evaluation

## Overview

This skill provides guidance for configuring and running LLM evaluations using [Promptfoo](https://www.promptfoo.dev/), an open-source CLI tool for testing and comparing LLM outputs.

## Quick Start

```bash
# Initialize a new evaluation project
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view
```

## Configuration Structure

A typical Promptfoo project structure:

```
project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions
```

## Core Configuration (promptfooconfig.yaml)

```yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"

# Prompts to test
prompts:
  - file://prompts/system.md
  - file://prompts/chat.json

# Models to compare
providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    label: Claude-4.5-Sonnet
  - id: openai:gpt-4.1
    label: GPT-4.1

# Test cases
tests: file://tests/cases.yaml

# Default assertions for all tests
defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:custom_assert
    - type: llm-rubric
      value: |
        Evaluate the response quality on a 0-1 scale.
      threshold: 0.7

# Output path
outputPath: results/eval-results.json
```

## Prompt Formats

### Text Prompt (system.md)

```markdown
You are a helpful assistant.

Task: {{task}}
Context: {{context}}
```

### Chat Format (chat.json)

```json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]
```

### Few-Shot Pattern

Embed examples directly in prompt or use chat format with assistant messages:

```json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]
```

## Test Cases (tests/cases.yaml)

```yaml
- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8
```

## Python Custom Assertions

Create a Python file for custom assertions (e.g., `scripts/metrics.py`):

```python
def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }
```

**Key points:**
- Default function name is `get_assert`
- Specify function with `file://path.py:function_name`
- Return `bool`, `float` (score), or `dict` with pass/score/reason
- Access variables via `context['vars']`

## LLM-as-Judge (llm-rubric)

```yaml
assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model
```

**Best practices:**
- Provide clear scoring criteria
- Use `threshold` to set minimum passing score
- Default grader uses available API keys (OpenAI → Anthropic → Google)

## Common Assertion Types

| Type | Usage | Example |
|------|-------|---------|
| `contains` | Check substring | `value: "hello"` |
| `icontains` | Case-insensitive | `value: "HELLO"` |
| `equals` | Exact match | `value: "42"` |
| `regex` | Pattern match | `value: "\\d{4}"` |
| `python` | Custom logic | `value: file://script.py` |
| `llm-rubric` | LLM grading | `value: "Is professional"` |
| `latency` | Response time | `threshold: 1000` |

## File References

All paths are relative to config file location:

```yaml
# Load file content as variable
vars:
  content: file://data/input.txt

# Load prompt from file
prompts:
  - file://prompts/main.md

# Load test cases from file
tests: file://tests/cases.yaml

# Load Python assertion
assert:
  - type: python
    value: file://scripts/check.py:validate
```

## Running Evaluations

```bash
# Basic run
npx promptfoo@latest eval

# With specific config
npx promptfoo@latest eval --config path/to/config.yaml

# Output to file
npx promptfoo@latest eval --output results.json

# Filter tests
npx promptfoo@latest eval --filter-metadata category=math

# View results
npx promptfoo@latest view
```

## Troubleshooting

**Python not found:**
```bash
export PROMPTFOO_PYTHON=python3
```

**Large outputs truncated:**
Outputs over 30000 characters are truncated. Use `head_limit` in assertions.

**File not found errors:**
Ensure paths are relative to `promptfooconfig.yaml` location.

## Echo Provider (Preview Mode)

Use the **echo provider** to preview rendered prompts without making API calls:

```yaml
# promptfooconfig-preview.yaml
providers:
  - echo  # Returns prompt as output, no API calls

tests:
  - vars:
      input: "test content"
```

**Use cases:**
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure

```bash
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
```

**Cost:** Free - no API tokens consumed.

## Advanced Few-Shot Implementation

### Multi-turn Conversation Pattern

For complex few-shot learning with full examples:

```json
[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]
```

**Test case configuration:**

```yaml
tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt
```

**Best practices:**
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure

## Long Text Handling

For Chinese/long-form content evaluations (10k+ characters):

**Configuration:**

```yaml
providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length
```

**Python assertion for text metrics:**

```python
import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }
```

## Real-World Example

**Project:** Chinese short-video content curation from long transcripts

**Structure:**
```
tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/
```

**See:** `/Users/tiansheng/Workspace/prompts/tiaogaoren/` for full implementation.

## Resources

For detailed API reference and advanced patterns, see [references/promptfoo_api.md](references/promptfoo_api.md).

Overview

This skill configures and runs LLM evaluation workflows using the Promptfoo framework. It helps you create prompt testing projects, define evaluation configs, and compare model outputs with custom assertions and LLM-as-judge rubrics. Use it to standardize prompt quality checks and automate model comparisons across providers.

How this skill works

The skill scaffolds a Promptfoo project with a promptfooconfig.yaml that lists prompts, providers, tests, and default assertions. It supports text and chat prompt formats, file-based test variables, Python custom assertions, and llm-rubric grading. Run evaluations via the Promptfoo CLI (init, eval, view) and optionally preview rendered prompts with the echo provider to avoid API calls.

When to use it

Setting up repeatable prompt tests and regression checks for an LLM-powered feature
Comparing output quality across models or provider configurations
Creating custom assertion logic using Python for task-specific metrics
Using LLM-as-judge rubrics to score subjective qualities like clarity or completeness
Validating few-shot example loading and prompt rendering before production runs

Best practices

Keep 1–3 few-shot examples that closely match the target task format
Use file:// paths relative to the config file for prompts, examples, and tests
Start with the echo provider to preview prompt rendering before spending API credits
Define clear rubric criteria and thresholds when using llm-rubric to ensure consistent grading
Implement concise Python assertions that return a dict with pass/score/reason for richer diagnostics

Example use cases

Automated regression tests for prompt changes during CI runs
Benchmarking GPT, Claude, and other models on a standardized test set
Creating task-specific scoring functions (e.g., length reduction, factuality checks) in Python
Using llm-rubric to rate answer clarity and accuracy across student responses
Previewing prompt rendering and variable substitution with the echo provider

FAQ

How do I run a quick preview without hitting APIs?

Use the echo provider in a preview config and run promptfoo eval; it returns rendered prompts without API calls.

How should Python assertion functions return results?

Return a bool, a float score, or a dict with pass, score, reason, and optional named_scores for full diagnostics.