home / skills / sammcj / agentic-coding / deepeval

deepeval skill

needs review

This skill helps you evaluate LLM applications using Deepeval's pytest-based metrics and tracing to ensure quality and safety.

npx playbooks add skill sammcj/agentic-coding --skill deepeval

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

13.8 KB

---
name: deepeval
description: Use when discussing or working with DeepEval (the python AI evaluation framework)
---

# DeepEval

## Overview

DeepEval is a pytest-based framework for testing LLM applications. It provides 50+ evaluation metrics covering RAG pipelines, conversational AI, agents, safety, and custom criteria. DeepEval integrates into development workflows through pytest, supports multiple LLM providers, and includes component-level tracing with the `@observe` decorator.

**Repository:** https://github.com/confident-ai/deepeval
**Documentation:** https://deepeval.com

## Installation

```bash
pip install -U deepeval
```

Requires Python 3.9+.

## Quick Start

### Basic pytest test

```python
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_chatbot():
    metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days for full refund"
    )
    assert_test(test_case, [metric])
```

Run with: `deepeval test run test_chatbot.py`

### Environment setup

DeepEval automatically loads `.env.local` then `.env`:

```bash
# .env
OPENAI_API_KEY="sk-..."
```

## Core Workflows

### RAG Evaluation

Evaluate both retrieval and generation phases:

```python
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)

# Retrieval metrics
contextual_precision = ContextualPrecisionMetric(threshold=0.7)
contextual_recall = ContextualRecallMetric(threshold=0.7)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)

# Generation metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)

test_case = LLMTestCase(
    input="What are the side effects of aspirin?",
    actual_output="Common side effects include stomach upset and nausea.",
    expected_output="Aspirin side effects include gastrointestinal issues.",
    retrieval_context=[
        "Aspirin common side effects: stomach upset, nausea, vomiting.",
        "Serious aspirin side effects: gastrointestinal bleeding.",
    ]
)

evaluate(test_cases=[test_case], metrics=[
    contextual_precision, contextual_recall, contextual_relevancy,
    answer_relevancy, faithfulness
])
```

**Component-level tracing:**

```python
from deepeval.tracing import observe, update_current_span

@observe(metrics=[contextual_relevancy])
def retriever(query: str):
    chunks = your_vector_db.search(query)
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=chunks)
    )
    return chunks

@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
    response = your_llm.generate(query, chunks)
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=chunks
        )
    )
    return response

@observe
def rag_pipeline(query: str):
    chunks = retriever(query)
    return generator(query, chunks)
```

### Conversational AI Evaluation

Test multi-turn dialogues:

```python
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    TurnRelevancyMetric
)

convo_test_case = ConversationalTestCase(
    chatbot_role="professional, empathetic medical assistant",
    turns=[
        Turn(role="user", content="I have a persistent cough"),
        Turn(role="assistant", content="How long have you had this cough?"),
        Turn(role="user", content="About a week now"),
        Turn(role="assistant", content="A week-long cough should be evaluated.")
    ]
)

metrics = [
    RoleAdherenceMetric(threshold=0.7),
    KnowledgeRetentionMetric(threshold=0.7),
    ConversationCompletenessMetric(threshold=0.6),
    TurnRelevancyMetric(threshold=0.7)
]

evaluate(test_cases=[convo_test_case], metrics=metrics)
```

### Agent Evaluation

Test tool usage and task completion:

```python
from deepeval.test_case import ToolCall
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolUseMetric,
    ArgumentCorrectnessMetric
)

agent_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="When did Trump first raise tariffs?"),
        Turn(
            role="assistant",
            content="Let me search for that information.",
            tools_called=[
                ToolCall(
                    name="WebSearch",
                    arguments={"query": "Trump first raised tariffs year"}
                )
            ]
        ),
        Turn(role="assistant", content="Trump first raised tariffs in 2018.")
    ]
)

evaluate(
    test_cases=[agent_test_case],
    metrics=[
        TaskCompletionMetric(threshold=0.7),
        ToolUseMetric(threshold=0.7),
        ArgumentCorrectnessMetric(threshold=0.7)
    ]
)
```

### Safety Evaluation

Check for harmful content:

```python
from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
    PIILeakageMetric,
    HallucinationMetric
)

def safety_gate(output: str, input: str) -> tuple[bool, list]:
    """Returns (passed, reasons) tuple"""
    test_case = LLMTestCase(input=input, actual_output=output)

    safety_metrics = [
        ToxicityMetric(threshold=0.5),
        BiasMetric(threshold=0.5),
        PIILeakageMetric(threshold=0.5)
    ]

    failures = []
    for metric in safety_metrics:
        metric.measure(test_case)
        if not metric.is_successful():
            failures.append(f"{metric.name}: {metric.reason}")

    return len(failures) == 0, failures
```

## Metric Selection Guide

### RAG Metrics

**Retrieval Phase:**
- `ContextualPrecisionMetric` - Relevant chunks ranked higher than irrelevant ones
- `ContextualRecallMetric` - All necessary information retrieved
- `ContextualRelevancyMetric` - Retrieved chunks relevant to input

**Generation Phase:**
- `AnswerRelevancyMetric` - Output addresses the input query
- `FaithfulnessMetric` - Output grounded in retrieval context

### Conversational Metrics

- `TurnRelevancyMetric` - Each turn relevant to conversation
- `KnowledgeRetentionMetric` - Information retained across turns
- `ConversationCompletenessMetric` - All aspects addressed
- `RoleAdherenceMetric` - Chatbot maintains assigned role
- `TopicAdherenceMetric` - Conversation stays on topic

### Agent Metrics

- `TaskCompletionMetric` - Task successfully completed
- `ToolUseMetric` - Correct tools selected
- `ArgumentCorrectnessMetric` - Tool arguments correct
- `MCPUseMetric` - MCP correctly used

### Safety Metrics

- `ToxicityMetric` - Harmful content detection
- `BiasMetric` - Biased outputs identification
- `HallucinationMetric` - Fabricated information
- `PIILeakageMetric` - Personal information leakage

### Custom Metrics

**G-Eval (LLM-based):**

```python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_metric = GEval(
    name="Professional Tone",
    criteria="Determine if response maintains professional, empathetic tone",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="anthropic-claude-sonnet-4-5"
)
```

**BaseMetric subclass:**

See `references/custom_metrics.md` for complete guide on creating custom metrics with BaseMetric subclassing and deterministic scorers (ROUGE, BLEU, BERTScore).

## Configuration

### LLM Provider Setup

DeepEval supports OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ providers via LiteLLM.
Anthropic models are preferred.

**CLI configuration (global):**

```bash
deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b
```

**Python configuration (per-metric):**

```python
from deepeval.models import AnthropicModel, OllamaModel

anthropic_model = AnthropicModel(
    model_id=settings.anthropic_model_id,
    client_args={"api_key": settings.anthropic_api_key},
    temperature=settings.agent_temperature
)

metric = AnswerRelevancyMetric(model=anthropic_model)
```

See `references/model_providers.md` for complete provider configuration guide.

### Performance Optimisation

Async mode is enabled by default. Configure with `AsyncConfig` and `CacheConfig`:

```python
from deepeval import evaluate, AsyncConfig, CacheConfig

evaluate(
    test_cases=[...],
    metrics=[...],
    async_config=AsyncConfig(
        run_async=True,
        max_concurrent=20,    # Reduce if rate limited
        throttle_value=0      # Delay between test cases (seconds)
    ),
    cache_config=CacheConfig(
        use_cache=True,       # Read from cache
        write_cache=True      # Write to cache
    )
)
```

**CLI parallelisation:**

```bash
deepeval test run -n 4 -c -i  # 4 processes, cached, ignore errors
```

**Best practices:**
- Limit to 5 metrics maximum (2-3 generic + 1-2 custom)
- Use the latest available Anthropic Claude Sonnet or Haiku models
- Reduce `max_concurrent` to 5 if hitting rate limits
- Use `evaluate()` function over individual `measure()` calls

See `references/async_performance.md` for detailed performance optimisation guide.

## Dataset Management

### Loading datasets

```python
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()

# From CSV
dataset.add_goldens_from_csv_file(
    file_path="./test_data.csv",
    input_col_name="question",
    expected_output_col_name="answer",
    context_col_name="context",
    context_col_delimiter="|"
)

# From JSON
dataset.add_goldens_from_json_file(
    file_path="./test_data.json",
    input_key_name="query",
    expected_output_key_name="response"
)
```

### Synthetic generation

```python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

# From documents
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["./docs/knowledge_base.pdf"],
    max_goldens_per_document=10,
    evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"]
)

# From scratch
goldens = synthesizer.generate_goldens_from_scratch(
    subject="customer support for SaaS product",
    task="answer user questions about billing",
    max_goldens=20
)
```

**Evolution types:** REASONING, MULTICONTEXT, CONCRETISING, CONSTRAINED, COMPARATIVE, HYPOTHETICAL, IN_BREADTH

See `references/dataset_management.md` for complete dataset guide including versioning and cloud integration.

## Test Case Types

### Single-turn (LLMTestCase)

```python
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="You have 30 days for full refund",
    expected_output="We offer 30-day full refund",
    retrieval_context=["All customers eligible for 30 day refund"],
    tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)
```

### Multi-turn (ConversationalTestCase)

```python
from deepeval.test_case import Turn, ConversationalTestCase

convo_test_case = ConversationalTestCase(
    chatbot_role="helpful customer service agent",
    turns=[
        Turn(role="user", content="I need help with my order"),
        Turn(role="assistant", content="I'd be happy to help"),
        Turn(role="user", content="It hasn't arrived yet")
    ]
)
```

### Multimodal (MLLMTestCase)

```python
from deepeval.test_case import MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
    actual_output=["A red bicycle leaning against a wall"]
)
```

## CI/CD Integration

```yaml
# .github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - name: Install dependencies
        run: pip install deepeval
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: deepeval test run tests/
```

## References

Detailed implementation guides:

- **references/model_providers.md** - Complete guide for configuring OpenAI, Anthropic, Gemini, Bedrock, and local models. Includes provider-specific considerations, cost analysis, and troubleshooting.

- **references/custom_metrics.md** - Complete guide for creating custom metrics by subclassing BaseMetric. Includes deterministic scorers (ROUGE, BLEU, BERTScore) and LLM-based evaluation patterns.

- **references/async_performance.md** - Complete guide for optimising evaluation performance with async mode, caching, concurrency tuning, and rate limit handling.

- **references/dataset_management.md** - Complete guide for dataset loading, saving, synthetic generation, versioning, and cloud integration with Confident AI.

## Best Practices

### Metric Selection
- Match metrics to use case (RAG systems need retrieval + generation metrics)
- Start with 2-3 essential metrics, expand as needed
- Use appropriate thresholds (0.7-0.8 for production, 0.5-0.6 for development)
- Combine complementary metrics (answer relevancy + faithfulness)

### Test Case Design
- Create representative examples covering common queries and edge cases
- Include context when needed (`retrieval_context` for RAG, `expected_output` for G-Eval)
- Use datasets for scale testing
- Version test cases over time

### Evaluation Workflow
- Component-level first - Use `@observe` for individual parts
- End-to-end validation before deployment
- Automate in CI/CD with `deepeval test run`
- Track results over time with Confident AI cloud

### Testing Anti-Patterns

**Avoid:**
- Testing only happy paths
- Using unrealistic inputs
- Ignoring metric reasons
- Setting thresholds too high initially
- Running full test suite on every change

**Do:**
- Test edge cases and failure modes
- Use real user queries as test inputs
- Read and analyse metric reasons
- Adjust thresholds based on empirical results
- Use component-level tests during development
- Separate config and eval content from code

Overview

This skill helps you evaluate LLM applications using DeepEval, a pytest-based evaluation framework focused on RAG, conversational AI, agents, safety, and custom metrics. It streamlines metric-driven testing, component tracing, and CI/CD integration to validate model behavior and guardrails. Use it to run reproducible tests, measure faithfulness and relevancy, and automate evaluations in pipelines.

How this skill works

The skill uses DeepEval test case types (single-turn, conversational, multimodal) and 50+ built-in metrics to score outputs against expected behavior. You can instrument components with @observe for span-level tracing, run asynchronous or cached evaluations, and configure providers per-metric or globally. Results integrate with pytest and can be automated in CI/CD workflows.

When to use it

Validating RAG pipelines (retrieval + generation)
Testing multi-turn conversational assistants and role adherence
Measuring agent tool usage and task completion
Running safety gates for toxicity, bias, or PII leakage
Automating model checks in CI/CD before deployment

Best practices

Start with 2–3 core metrics and add 1–2 custom metrics only as needed
Use component-level tests with @observe before end-to-end runs
Prefer Anthropic Sonnet/Haiku models when available for evaluation
Keep thresholds moderate in development (0.5–0.6) and tighten for production (0.7–0.8)
Use async and cache configs to balance throughput and rate limits; reduce concurrency if throttled

Example use cases

RAG evaluation: measure contextual precision, recall, relevancy, answer relevancy and faithfulness on retrieved chunks
Conversational evaluation: validate role adherence, knowledge retention, turn relevancy across multi-turn dialogues
Agent evaluation: confirm correct tool selection, argument correctness, and task completion for tool-using agents
Safety gate: run toxicity, bias, and PII checks and return pass/fail with failure reasons
CI automation: run deepeval test run in GitHub Actions to fail PRs on regressions

FAQ

How do I configure a specific LLM provider for a metric?

You can pass a provider model object (e.g., AnthropicModel, OllamaModel) when constructing a metric or set global provider via CLI commands. Provider-specific client args and model_id are supported per metric.

Can I create custom metrics?

Yes. Create a custom metric by subclassing BaseMetric or use GEval for LLM-based evaluators. Deterministic scorers like ROUGE, BLEU, and BERTScore are supported for custom metrics.