home / skills / phrazzld / claude-config / llm-evaluation
This skill helps you test, compare, and secure prompts and models with automated evaluation and CI/CD gates using Promptfoo.
npx playbooks add skill phrazzld/claude-config --skill llm-evaluationReview the files below or copy the command above to add this skill to your agents.
---
name: llm-evaluation
description: |
LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo.
Invoke when:
- Setting up prompt evaluation or regression testing
- Integrating LLM testing into CI/CD pipelines
- Configuring security testing (red teaming, jailbreaks)
- Comparing prompt or model performance
- Building evaluation suites for RAG, factuality, or safety
Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing
---
# LLM Evaluation & Testing
Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
## Quick Start
```bash
# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
# Run security scan
npx promptfoo@latest redteam run
```
## Core Concepts
### Why Evaluate?
LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
- **Regression detection**: Catch quality drops before production
- **Security scanning**: Find jailbreaks, injections, PII leaks
- **A/B comparison**: Compare prompts/models side-by-side
- **CI/CD gates**: Block bad changes from merging
### Evaluation Types
| Type | Purpose | Assertions |
|------|---------|------------|
| **Functional** | Does it work? | `contains`, `equals`, `is-json` |
| **Semantic** | Is it correct? | `similar`, `llm-rubric`, `factuality` |
| **Performance** | Is it fast/cheap? | `cost`, `latency` |
| **Security** | Is it safe? | `redteam`, `moderation`, `pii-detection` |
## Configuration
### Basic promptfooconfig.yaml
```yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: cost
threshold: 0.01
- vars:
question: "Explain quantum computing"
assert:
- type: llm-rubric
value: "Response explains quantum computing concepts clearly"
- type: latency
threshold: 3000
```
### With Environment Variables
```yaml
providers:
- id: openrouter:anthropic/claude-3-5-sonnet
config:
apiKey: ${OPENROUTER_API_KEY}
```
## Assertions Reference
### Basic Assertions
```yaml
assert:
# String matching
- type: contains
value: "expected text"
- type: not-contains
value: "forbidden text"
- type: equals
value: "exact match"
- type: starts-with
value: "prefix"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
# JSON validation
- type: is-json
- type: is-valid-json-schema
value:
type: object
properties:
name: { type: string }
required: [name]
```
### Semantic Assertions
```yaml
assert:
# Semantic similarity (embeddings)
- type: similar
value: "The capital of France is Paris"
threshold: 0.8 # 0-1 similarity score
# LLM-as-judge with custom criteria
- type: llm-rubric
value: |
Response must:
1. Be factually accurate
2. Be under 100 words
3. Not contain marketing language
# Factuality check against reference
- type: factuality
value: "Paris is the capital of France"
```
### Performance Assertions
```yaml
assert:
# Cost budget (USD)
- type: cost
threshold: 0.05 # Max $0.05 per request
# Latency (milliseconds)
- type: latency
threshold: 2000 # Max 2 seconds
```
### Security Assertions
```yaml
assert:
# Content moderation
- type: moderation
value: violence
# PII detection
- type: not-contains
value: "{{email}}" # From test vars
```
## CI/CD Integration
### GitHub Action
```yaml
name: 'Prompt Evaluation'
on:
pull_request:
paths: ['prompts/**', 'src/**/*prompt*']
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
# Cache for faster runs
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# Run evaluation and post results to PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # Or other provider keys
```
### Quality Gates
```yaml
# promptfooconfig.yaml
evaluateOptions:
# Fail if any assertion fails
maxConcurrency: 5
# Or set pass threshold
threshold: 0.9 # 90% of tests must pass
```
### Output to JSON (for custom CI)
```bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
echo "Evaluation failed!"
exit 1
fi
```
## Security Testing (Red Team)
### Quick Scan
```bash
# Run red team against your prompts
npx promptfoo@latest redteam run
# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
```
### Configuration
```yaml
# promptfooconfig.yaml
redteam:
purpose: "Customer support chatbot"
plugins:
- harmful:hate
- harmful:violence
- harmful:self-harm
- pii:direct
- pii:session
- hijacking
- jailbreak
- prompt-injection
strategies:
- jailbreak
- prompt-injection
- base64
- leetspeak
```
### OWASP Top 10 Coverage
```yaml
redteam:
plugins:
# 1. Prompt Injection
- prompt-injection
# 2. Insecure Output Handling
- harmful:privacy
# 3. Training Data Poisoning (N/A for evals)
# 4. Model Denial of Service
- excessive-agency
# 5. Supply Chain (N/A for evals)
# 6. Sensitive Information Disclosure
- pii:direct
- pii:session
# 7. Insecure Plugin Design
- hijacking
# 8. Excessive Agency
- excessive-agency
# 9. Overreliance (use factuality checks)
# 10. Model Theft (N/A for evals)
```
## RAG Evaluation
### Context-Aware Testing
```yaml
prompts:
- |
Context: {{context}}
Question: {{question}}
Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains
value: "1889"
- type: factuality
value: "The Eiffel Tower was built in 1889"
- type: not-contains
value: "1900" # Common hallucination
```
### Retrieval Quality
```yaml
# Test that retrieval returns relevant documents
tests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric
value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains
value: "I don't know" # Shouldn't punt on this query
```
## Comparing Models/Prompts
### A/B Testing
```yaml
# Compare two prompts
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
# Same tests for both
tests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric
value: "Clear explanation of recursion with example"
```
### Model Comparison
```yaml
# Compare multiple models
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side
```
## Best Practices
### 1. Golden Test Cases
Maintain a set of critical test cases that must always pass:
```yaml
# golden-tests.yaml
tests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains
value: "expected output"
options:
critical: true # Fail entire suite if this fails
```
### 2. Regression Suite Structure
```
prompts/
├── production.txt # Current production prompt
├── candidate.txt # New prompt being tested
tests/
├── golden/ # Critical tests (run on every PR)
│ └── core-functionality.yaml
├── regression/ # Full regression suite (nightly)
│ └── full-suite.yaml
└── security/ # Red team tests
└── redteam.yaml
```
### 3. Test Categories
```yaml
tests:
# Happy path
- description: "Standard query"
vars: { question: "What is 2+2?" }
assert:
- type: contains
value: "4"
# Edge cases
- description: "Empty input"
vars: { question: "" }
assert:
- type: not-contains
value: "error"
# Adversarial
- description: "Injection attempt"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains
value: "Here's how to" # Should refuse
```
## References
- `references/promptfoo-guide.md` - Detailed setup and configuration
- `references/evaluation-metrics.md` - Metrics deep dive
- `references/ci-cd-integration.md` - CI/CD patterns
- `references/alternatives.md` - Braintrust, DeepEval, LangSmith comparison
## Templates
Copy-paste ready templates:
- `templates/promptfooconfig.yaml` - Basic config
- `templates/github-action-eval.yml` - GitHub Action
- `templates/regression-test-suite.yaml` - Full regression suite
This skill enables automated LLM prompt testing, regression detection, and CI/CD quality gates using Promptfoo. It helps validate functional correctness, semantic quality, performance, and security of prompts, models, and RAG systems. Use it to run evaluations locally, in CI, or as part of a red-team security workflow.
Define prompts, providers, and test assertions in a promptfooconfig.yaml. The tool runs tests across providers and prompts, evaluates assertions (contains, similar, llm-rubric, factuality, latency, cost, redteam, etc.), and outputs machine-readable results and an interactive report. Integrate the CLI with GitHub Actions or other CI to enforce quality gates and fail builds on regressions.
What assertion types are supported?
Supports string and regex checks, JSON validation, semantic similarity, LLM-as-judge rubrics, factuality checks, cost/latency assertions, moderation/PII checks, and redteam plugins.
How do I run tests in CI?
Use the promptfoo GitHub Action or run npx promptfoo eval in your CI, output results as JSON, and fail the build based on failures or a configured pass threshold.