home / skills / phrazzld / claude-config / llm-evaluation

llm-evaluation skill

needs review

/skills/llm-evaluation

This skill enables automated LLM evaluation, regression and security testing with Promptfoo, integrating into CI/CD to improve prompt quality and safety.

npx playbooks add skill phrazzld/claude-config --skill llm-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (8)

SKILL.md

9.2 KB

---
name: llm-evaluation
description: |
  LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo.
  Invoke when:
  - Setting up prompt evaluation or regression testing
  - Integrating LLM testing into CI/CD pipelines
  - Configuring security testing (red teaming, jailbreaks)
  - Comparing prompt or model performance
  - Building evaluation suites for RAG, factuality, or safety
  Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing
effort: high
---

# LLM Evaluation & Testing

Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.

## Quick Start

```bash
# Initialize Promptfoo (no global install needed)
npx promptfoo@latest init

# Run evaluation
npx promptfoo@latest eval

# View results in browser
npx promptfoo@latest view

# Run security scan
npx promptfoo@latest redteam run
```

## Core Concepts

### Why Evaluate?

LLM outputs are non-deterministic. "It looks good" isn't testing. You need:

- **Regression detection**: Catch quality drops before production
- **Security scanning**: Find jailbreaks, injections, PII leaks
- **A/B comparison**: Compare prompts/models side-by-side
- **CI/CD gates**: Block bad changes from merging

### Evaluation Types

| Type | Purpose | Assertions |
|------|---------|------------|
| **Functional** | Does it work? | `contains`, `equals`, `is-json` |
| **Semantic** | Is it correct? | `similar`, `llm-rubric`, `factuality` |
| **Performance** | Is it fast/cheap? | `cost`, `latency` |
| **Security** | Is it safe? | `redteam`, `moderation`, `pii-detection` |

## Configuration

### Basic promptfooconfig.yaml

```yaml
description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000
```

### With Environment Variables

```yaml
providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}
```

## Assertions Reference

### Basic Assertions

```yaml
assert:
  # String matching
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # Date pattern

  # JSON validation
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]
```

### Semantic Assertions

```yaml
assert:
  # Semantic similarity (embeddings)
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1 similarity score

  # LLM-as-judge with custom criteria
  - type: llm-rubric
    value: |
      Response must:
      1. Be factually accurate
      2. Be under 100 words
      3. Not contain marketing language

  # Factuality check against reference
  - type: factuality
    value: "Paris is the capital of France"
```

### Performance Assertions

```yaml
assert:
  # Cost budget (USD)
  - type: cost
    threshold: 0.05  # Max $0.05 per request

  # Latency (milliseconds)
  - type: latency
    threshold: 2000  # Max 2 seconds
```

### Security Assertions

```yaml
assert:
  # Content moderation
  - type: moderation
    value: violence

  # PII detection
  - type: not-contains
    value: "{{email}}"  # From test vars
```

## CI/CD Integration

### GitHub Action

```yaml
name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # Cache for faster runs
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # Run evaluation and post results to PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # Or other provider keys
```

### Quality Gates

```yaml
# promptfooconfig.yaml
evaluateOptions:
  # Fail if any assertion fails
  maxConcurrency: 5

  # Or set pass threshold
  threshold: 0.9  # 90% of tests must pass
```

### Output to JSON (for custom CI)

```bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

# Check results in CI script
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
  echo "Evaluation failed!"
  exit 1
fi
```

## Security Testing (Red Team)

### Quick Scan

```bash
# Run red team against your prompts
npx promptfoo@latest redteam run

# Generate compliance report
npx promptfoo@latest redteam report --output compliance.html
```

### Configuration

```yaml
# promptfooconfig.yaml
redteam:
  purpose: "Customer support chatbot"
  plugins:
    - harmful:hate
    - harmful:violence
    - harmful:self-harm
    - pii:direct
    - pii:session
    - hijacking
    - jailbreak
    - prompt-injection

  strategies:
    - jailbreak
    - prompt-injection
    - base64
    - leetspeak
```

### OWASP Top 10 Coverage

```yaml
redteam:
  plugins:
    # 1. Prompt Injection
    - prompt-injection
    # 2. Insecure Output Handling
    - harmful:privacy
    # 3. Training Data Poisoning (N/A for evals)
    # 4. Model Denial of Service
    - excessive-agency
    # 5. Supply Chain (N/A for evals)
    # 6. Sensitive Information Disclosure
    - pii:direct
    - pii:session
    # 7. Insecure Plugin Design
    - hijacking
    # 8. Excessive Agency
    - excessive-agency
    # 9. Overreliance (use factuality checks)
    # 10. Model Theft (N/A for evals)
```

## RAG Evaluation

### Context-Aware Testing

```yaml
prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # Common hallucination
```

### Retrieval Quality

```yaml
# Test that retrieval returns relevant documents
tests:
  - vars:
      query: "Python list comprehension"
    assert:
      - type: llm-rubric
        value: "Response discusses Python list comprehension syntax and examples"
      - type: not-contains
        value: "I don't know"  # Shouldn't punt on this query
```

## Comparing Models/Prompts

### A/B Testing

```yaml
# Compare two prompts
prompts:
  - file://prompts/v1.txt
  - file://prompts/v2.txt

# Same tests for both
tests:
  - vars: { question: "Explain recursion" }
    assert:
      - type: llm-rubric
        value: "Clear explanation of recursion with example"
```

### Model Comparison

```yaml
# Compare multiple models
providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-haiku-latest
  - openrouter:google/gemini-flash-1.5

# Run: npx promptfoo@latest eval
# View: npx promptfoo@latest view
# Compare cost, latency, quality side-by-side
```

## Best Practices

### 1. Golden Test Cases

Maintain a set of critical test cases that must always pass:

```yaml
# golden-tests.yaml
tests:
  - description: "Core functionality - must pass"
    vars:
      input: "critical test case"
    assert:
      - type: contains
        value: "expected output"
    options:
      critical: true  # Fail entire suite if this fails
```

### 2. Regression Suite Structure

```
prompts/
├── production.txt          # Current production prompt
├── candidate.txt           # New prompt being tested
tests/
├── golden/                 # Critical tests (run on every PR)
│   └── core-functionality.yaml
├── regression/             # Full regression suite (nightly)
│   └── full-suite.yaml
└── security/               # Red team tests
    └── redteam.yaml
```

### 3. Test Categories

```yaml
tests:
  # Happy path
  - description: "Standard query"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # Edge cases
  - description: "Empty input"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # Adversarial
  - description: "Injection attempt"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # Should refuse
```

## References

- `references/promptfoo-guide.md` - Detailed setup and configuration
- `references/evaluation-metrics.md` - Metrics deep dive
- `references/ci-cd-integration.md` - CI/CD patterns
- `references/alternatives.md` - Braintrust, DeepEval, LangSmith comparison

## Templates

Copy-paste ready templates:
- `templates/promptfooconfig.yaml` - Basic config
- `templates/github-action-eval.yml` - GitHub Action
- `templates/regression-test-suite.yaml` - Full regression suite

Overview

This skill provides automated LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. It helps embed regression detection, semantic checks, performance budgets, and security (red team) scans into development workflows. Use it to compare prompts or models, validate RAG behavior, and block regressions before they reach production.

How this skill works

The skill runs Promptfoo evaluation suites defined in promptfooconfig.yaml to execute prompts across providers and assert expected behavior. Assertions cover functional checks (contains, equals), semantic judgments (similarity, llm-rubric, factuality), performance (cost, latency), and security (redteam, moderation, PII). Results can be viewed locally, exported as JSON for CI logic, or surfaced in pull requests via a GitHub Action.

When to use it

Add prompt or model regression testing to CI pipelines
Validate RAG outputs and retrieval relevance before release
Automate security scans and jailbreak detection for chatbots
Compare prompt variants or multiple models A/B-style
Enforce cost and latency budgets in production systems

Best practices

Keep a golden set of critical tests that must always pass to prevent regressions
Organize suites into categories: golden, regression, security, and nightly full runs
Use environment variables for provider keys and avoid committing secrets
Set quality gates or pass thresholds in evaluateOptions to fail PRs on regressions
Export results.json from eval for custom CI checks and metrics collection

Example use cases

CI quality gate: run promptfoo eval in a GitHub Action and fail the PR if core assertions fail
RAG validation: assert answers strictly reference provided context and run factuality checks
Security red-team: run promptfoo redteam run with plugins for jailbreaks, PII, and harmful content
Model/Prompt comparison: run same test suite across multiple providers to compare cost, latency, and quality
Regression monitoring: nightly full-suite runs to detect slow degradations or metric drift

FAQ

How do I integrate results into CI?

Run npx promptfoo eval -o results.json and parse stats.failures in your CI script; alternatively use the official promptfoo GitHub Action to post results to PRs.

Can I check for PII or moderation issues?

Yes. Use security assertions like moderation, not-contains for templated PII variables, and run redteam plugins to simulate leaks and jailbreaks.