home / skills / phrazzld / claude-config / llm-evaluation
/skills/llm-evaluation
This skill enables automated LLM evaluation, regression and security testing with Promptfoo, integrating into CI/CD to improve prompt quality and safety.
npx playbooks add skill phrazzld/claude-config --skill llm-evaluationReview the files below or copy the command above to add this skill to your agents.
---
name: llm-evaluation
description: |
LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo.
Invoke when:
- Setting up prompt evaluation or regression testing
- Integrating LLM testing into CI/CD pipelines
- Configuring security testing (red teaming, jailbreaks)
- Comparing prompt or model performance
- Building evaluation suites for RAG, factuality, or safety
Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing
effort: high
---
# LLM Evaluation & Testing
Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
## Quick Start
```bash
# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
# Run security scan
npx promptfoo@latest redteam run
```
## Core Concepts
### Why Evaluate?
LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
- **Regression detection**: Catch quality drops before production
- **Security scanning**: Find jailbreaks, injections, PII leaks
- **A/B comparison**: Compare prompts/models side-by-side
- **CI/CD gates**: Block bad changes from merging
### Evaluation Types
| Type | Purpose | Assertions |
|------|---------|------------|
| **Functional** | Does it work? | `contains`, `equals`, `is-json` |
| **Semantic** | Is it correct? | `similar`, `llm-rubric`, `factuality` |
| **Performance** | Is it fast/cheap? | `cost`, `latency` |
| **Security** | Is it safe? | `redteam`, `moderation`, `pii-detection` |
## Configuration
### Basic promptfooconfig.yaml
```yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: cost
threshold: 0.01
- vars:
question: "Explain quantum computing"
assert:
- type: llm-rubric
value: "Response explains quantum computing concepts clearly"
- type: latency
threshold: 3000
```
### With Environment Variables
```yaml
providers:
- id: openrouter:anthropic/claude-3-5-sonnet
config:
apiKey: ${OPENROUTER_API_KEY}
```
## Assertions Reference
### Basic Assertions
```yaml
assert:
# String matching
- type: contains
value: "expected text"
- type: not-contains
value: "forbidden text"
- type: equals
value: "exact match"
- type: starts-with
value: "prefix"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
# JSON validation
- type: is-json
- type: is-valid-json-schema
value:
type: object
properties:
name: { type: string }
required: [name]
```
### Semantic Assertions
```yaml
assert:
# Semantic similarity (embeddings)
- type: similar
value: "The capital of France is Paris"
threshold: 0.8 # 0-1 similarity score
# LLM-as-judge with custom criteria
- type: llm-rubric
value: |
Response must:
1. Be factually accurate
2. Be under 100 words
3. Not contain marketing language
# Factuality check against reference
- type: factuality
value: "Paris is the capital of France"
```
### Performance Assertions
```yaml
assert:
# Cost budget (USD)
- type: cost
threshold: 0.05 # Max $0.05 per request
# Latency (milliseconds)
- type: latency
threshold: 2000 # Max 2 seconds
```
### Security Assertions
```yaml
assert:
# Content moderation
- type: moderation
value: violence
# PII detection
- type: not-contains
value: "{{email}}" # From test vars
```
## CI/CD Integration
### GitHub Action
```yaml
name: 'Prompt Evaluation'
on:
pull_request:
paths: ['prompts/**', 'src/**/*prompt*']
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
# Cache for faster runs
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# Run evaluation and post results to PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # Or other provider keys
```
### Quality Gates
```yaml
# promptfooconfig.yaml
evaluateOptions:
# Fail if any assertion fails
maxConcurrency: 5
# Or set pass threshold
threshold: 0.9 # 90% of tests must pass
```
### Output to JSON (for custom CI)
```bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
echo "Evaluation failed!"
exit 1
fi
```
## Security Testing (Red Team)
### Quick Scan
```bash
# Run red team against your prompts
npx promptfoo@latest redteam run
# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
```
### Configuration
```yaml
# promptfooconfig.yaml
redteam:
purpose: "Customer support chatbot"
plugins:
- harmful:hate
- harmful:violence
- harmful:self-harm
- pii:direct
- pii:session
- hijacking
- jailbreak
- prompt-injection
strategies:
- jailbreak
- prompt-injection
- base64
- leetspeak
```
### OWASP Top 10 Coverage
```yaml
redteam:
plugins:
# 1. Prompt Injection
- prompt-injection
# 2. Insecure Output Handling
- harmful:privacy
# 3. Training Data Poisoning (N/A for evals)
# 4. Model Denial of Service
- excessive-agency
# 5. Supply Chain (N/A for evals)
# 6. Sensitive Information Disclosure
- pii:direct
- pii:session
# 7. Insecure Plugin Design
- hijacking
# 8. Excessive Agency
- excessive-agency
# 9. Overreliance (use factuality checks)
# 10. Model Theft (N/A for evals)
```
## RAG Evaluation
### Context-Aware Testing
```yaml
prompts:
- |
Context: {{context}}
Question: {{question}}
Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains
value: "1889"
- type: factuality
value: "The Eiffel Tower was built in 1889"
- type: not-contains
value: "1900" # Common hallucination
```
### Retrieval Quality
```yaml
# Test that retrieval returns relevant documents
tests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric
value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains
value: "I don't know" # Shouldn't punt on this query
```
## Comparing Models/Prompts
### A/B Testing
```yaml
# Compare two prompts
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
# Same tests for both
tests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric
value: "Clear explanation of recursion with example"
```
### Model Comparison
```yaml
# Compare multiple models
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side
```
## Best Practices
### 1. Golden Test Cases
Maintain a set of critical test cases that must always pass:
```yaml
# golden-tests.yaml
tests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains
value: "expected output"
options:
critical: true # Fail entire suite if this fails
```
### 2. Regression Suite Structure
```
prompts/
├── production.txt # Current production prompt
├── candidate.txt # New prompt being tested
tests/
├── golden/ # Critical tests (run on every PR)
│ └── core-functionality.yaml
├── regression/ # Full regression suite (nightly)
│ └── full-suite.yaml
└── security/ # Red team tests
└── redteam.yaml
```
### 3. Test Categories
```yaml
tests:
# Happy path
- description: "Standard query"
vars: { question: "What is 2+2?" }
assert:
- type: contains
value: "4"
# Edge cases
- description: "Empty input"
vars: { question: "" }
assert:
- type: not-contains
value: "error"
# Adversarial
- description: "Injection attempt"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains
value: "Here's how to" # Should refuse
```
## References
- `references/promptfoo-guide.md` - Detailed setup and configuration
- `references/evaluation-metrics.md` - Metrics deep dive
- `references/ci-cd-integration.md` - CI/CD patterns
- `references/alternatives.md` - Braintrust, DeepEval, LangSmith comparison
## Templates
Copy-paste ready templates:
- `templates/promptfooconfig.yaml` - Basic config
- `templates/github-action-eval.yml` - GitHub Action
- `templates/regression-test-suite.yaml` - Full regression suite
This skill provides automated LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. It helps embed regression detection, semantic checks, performance budgets, and security (red team) scans into development workflows. Use it to compare prompts or models, validate RAG behavior, and block regressions before they reach production.
The skill runs Promptfoo evaluation suites defined in promptfooconfig.yaml to execute prompts across providers and assert expected behavior. Assertions cover functional checks (contains, equals), semantic judgments (similarity, llm-rubric, factuality), performance (cost, latency), and security (redteam, moderation, PII). Results can be viewed locally, exported as JSON for CI logic, or surfaced in pull requests via a GitHub Action.
How do I integrate results into CI?
Run npx promptfoo eval -o results.json and parse stats.failures in your CI script; alternatively use the official promptfoo GitHub Action to post results to PRs.
Can I check for PII or moderation issues?
Yes. Use security assertions like moderation, not-contains for templated PII variables, and run redteam plugins to simulate leaks and jailbreaks.