home / skills / openclaw / skills / azure-ai-evaluation-py

azure-ai-evaluation-py skill

/skills/thegovind/azure-ai-evaluation-py

This skill helps evaluate AI applications with built-in and custom evaluators from Azure AI Evaluation SDK, ensuring quality, safety, and measurable

npx playbooks add skill openclaw/skills --skill azure-ai-evaluation-py

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
6.9 KB
---
name: azure-ai-evaluation-py
description: |
  Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators.
  Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".
package: azure-ai-evaluation
---

# Azure AI Evaluation SDK for Python

Assess generative AI application performance with built-in and custom evaluators.

## Installation

```bash
pip install azure-ai-evaluation

# With remote evaluation support
pip install azure-ai-evaluation[remote]
```

## Environment Variables

```bash
# For AI-assisted evaluators
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
AZURE_OPENAI_API_KEY=<your-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini

# For Foundry project integration
AIPROJECT_CONNECTION_STRING=<your-connection-string>
```

## Built-in Evaluators

### Quality Evaluators (AI-Assisted)

```python
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
    RetrievalEvaluator
)

# Initialize with Azure OpenAI model config
model_config = {
    "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
    "api_key": os.environ["AZURE_OPENAI_API_KEY"],
    "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}

groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
```

### Quality Evaluators (NLP-based)

```python
from azure.ai.evaluation import (
    F1ScoreEvaluator,
    RougeScoreEvaluator,
    BleuScoreEvaluator,
    GleuScoreEvaluator,
    MeteorScoreEvaluator
)

f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()
```

### Safety Evaluators

```python
from azure.ai.evaluation import (
    ViolenceEvaluator,
    SexualEvaluator,
    SelfHarmEvaluator,
    HateUnfairnessEvaluator,
    IndirectAttackEvaluator,
    ProtectedMaterialEvaluator
)

violence = ViolenceEvaluator(azure_ai_project=project_scope)
sexual = SexualEvaluator(azure_ai_project=project_scope)
```

## Single Row Evaluation

```python
from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config)

result = groundedness(
    query="What is Azure AI?",
    context="Azure AI is Microsoft's AI platform...",
    response="Azure AI provides AI services and tools."
)

print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")
```

## Batch Evaluation with evaluate()

```python
from azure.ai.evaluation import evaluate

result = evaluate(
    data="test_data.jsonl",
    evaluators={
        "groundedness": groundedness,
        "relevance": relevance,
        "coherence": coherence
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }
        }
    }
)

print(result["metrics"])
```

## Composite Evaluators

```python
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator

# All quality metrics in one
qa_evaluator = QAEvaluator(model_config)

# All safety metrics in one
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)

result = evaluate(
    data="data.jsonl",
    evaluators={
        "qa": qa_evaluator,
        "content_safety": safety_evaluator
    }
)
```

## Evaluate Application Target

```python
from azure.ai.evaluation import evaluate
from my_app import chat_app  # Your application

result = evaluate(
    data="queries.jsonl",
    target=chat_app,  # Callable that takes query, returns response
    evaluators={
        "groundedness": groundedness
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${outputs.context}",
                "response": "${outputs.response}"
            }
        }
    }
)
```

## Custom Evaluators

### Code-Based

```python
from azure.ai.evaluation import evaluator

@evaluator
def word_count_evaluator(response: str) -> dict:
    return {"word_count": len(response.split())}

# Use in evaluate()
result = evaluate(
    data="data.jsonl",
    evaluators={"word_count": word_count_evaluator}
)
```

### Prompt-Based

```python
from azure.ai.evaluation import PromptChatTarget

class CustomEvaluator:
    def __init__(self, model_config):
        self.model = PromptChatTarget(model_config)
    
    def __call__(self, query: str, response: str) -> dict:
        prompt = f"Rate this response 1-5: Query: {query}, Response: {response}"
        result = self.model.send_prompt(prompt)
        return {"custom_score": int(result)}
```

## Log to Foundry Project

```python
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
    credential=DefaultAzureCredential()
)

result = evaluate(
    data="data.jsonl",
    evaluators={"groundedness": groundedness},
    azure_ai_project=project.scope  # Logs results to Foundry
)

print(f"View results: {result['studio_url']}")
```

## Evaluator Reference

| Evaluator | Type | Metrics |
|-----------|------|---------|
| `GroundednessEvaluator` | AI | groundedness (1-5) |
| `RelevanceEvaluator` | AI | relevance (1-5) |
| `CoherenceEvaluator` | AI | coherence (1-5) |
| `FluencyEvaluator` | AI | fluency (1-5) |
| `SimilarityEvaluator` | AI | similarity (1-5) |
| `RetrievalEvaluator` | AI | retrieval (1-5) |
| `F1ScoreEvaluator` | NLP | f1_score (0-1) |
| `RougeScoreEvaluator` | NLP | rouge scores |
| `ViolenceEvaluator` | Safety | violence (0-7) |
| `SexualEvaluator` | Safety | sexual (0-7) |
| `SelfHarmEvaluator` | Safety | self_harm (0-7) |
| `HateUnfairnessEvaluator` | Safety | hate_unfairness (0-7) |
| `QAEvaluator` | Composite | All quality metrics |
| `ContentSafetyEvaluator` | Composite | All safety metrics |

## Best Practices

1. **Use composite evaluators** for comprehensive assessment
2. **Map columns correctly** — mismatched columns cause silent failures
3. **Log to Foundry** for tracking and comparison across runs
4. **Create custom evaluators** for domain-specific metrics
5. **Use NLP evaluators** when you have ground truth answers
6. **Safety evaluators require** Azure AI project scope
7. **Batch evaluation** is more efficient than single-row loops

## Reference Files

| File | Contents |
|------|----------|
| [references/built-in-evaluators.md](references/built-in-evaluators.md) | Detailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables |
| [references/custom-evaluators.md](references/custom-evaluators.md) | Creating code-based and prompt-based custom evaluators, testing patterns |
| [scripts/run_batch_evaluation.py](scripts/run_batch_evaluation.py) | CLI tool for running batch evaluations with quality, safety, and custom evaluators |

Overview

This skill provides a Python SDK for evaluating generative AI applications using built-in AI-assisted, NLP-based, safety, and custom evaluators. It helps measure quality metrics like groundedness, relevance, coherence, and fluency, plus safety dimensions such as violence and hate. Use it to run single-row checks, batch evaluations, composite assessments, and to log results to Azure Foundry projects.

How this skill works

The SDK exposes evaluator classes (e.g., GroundednessEvaluator, RelevanceEvaluator, F1ScoreEvaluator, ViolenceEvaluator) and an evaluate() function that runs evaluators against data or a callable application target. Evaluators can be AI-assisted (require Azure OpenAI model settings), NLP-based (compare to ground truth), or safety-focused (require an Azure AI project scope). You can also register code-based or prompt-based custom evaluators and output metrics to Foundry for tracking.

When to use it

  • Validating model responses against context or ground truth at scale using batch evaluation.
  • Measuring dialogue quality metrics (groundedness, relevance, coherence, fluency) for releases or A/B tests.
  • Scanning outputs for safety risks like violence, sexual content, or self-harm before deployment.
  • Comparing system versions or logging evaluation runs to Azure Foundry for historical analysis.
  • Adding domain-specific checks via custom code-based or prompt-based evaluators.

Best practices

  • Use composite evaluators (QAEvaluator, ContentSafetyEvaluator) for broad coverage with a single call.
  • Ensure correct column mapping in evaluator_config to avoid silent mismatches between input and evaluator fields.
  • Prefer batch evaluation over per-row loops for efficiency and consistent metrics.
  • Provide Azure OpenAI model config for AI-assisted evaluators and set the Azure AI project scope for safety evaluators.
  • Create and test custom evaluators locally before including them in large-scale runs.

Example use cases

  • Run groundedness and relevance checks on a customer support assistant using a JSONL test set.
  • Automate safety scans for model releases and log results to Foundry for compliance audits.
  • Evaluate a chatbot target callable directly by passing it as the evaluate() target to measure live behavior.
  • Add a custom word-count or rubric-based prompt evaluator to track domain-specific quality metrics.
  • Compare NLP metrics (F1, ROUGE, BLEU) against human-labeled answers for information extraction models.

FAQ

Do AI-assisted evaluators require additional configuration?

Yes. AI-assisted evaluators need Azure OpenAI endpoint, API key, and deployment set in model_config or environment variables.

How do I log results to Foundry?

Create an AIProjectClient with your connection string and pass its project scope to evaluate() using azure_ai_project to enable logging and Studio URLs.