home / skills / doanchienthangdev / omgkit / ai-system-evaluation

ai-system-evaluation skill

Q: How many samples should I use for evaluation?

Aim for 100+ representative examples per use case. More examples improve statistical confidence, but prioritize diversity and realism over sheer volume.

Q: Which benchmarks are most useful?

Use public benchmarks as reference (MMLU, HumanEval, GSM-8K, MT-Bench) but prioritize evaluations on your domain data to avoid contamination and overfitting.

safe

/plugin/skills/ai-engineering/ai-system-evaluation

This skill helps you evaluate end-to-end AI system performance, guiding model selection, benchmarking, cost and latency analysis, and architecture decisions.

npx playbooks add skill doanchienthangdev/omgkit --skill ai-system-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.4 KB

---
name: ai-system-evaluation
description: End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.
---

# AI System Evaluation

Evaluating AI systems end-to-end.

## Evaluation Criteria

### 1. Domain-Specific Capability

| Domain | Benchmarks |
|--------|------------|
| Math & Reasoning | GSM-8K, MATH |
| Code | HumanEval, MBPP |
| Knowledge | MMLU, ARC |
| Multi-turn Chat | MT-Bench |

### 2. Generation Quality

| Criterion | Measurement |
|-----------|-------------|
| Factual Consistency | NLI, SAFE, SelfCheckGPT |
| Coherence | AI judge rubric |
| Relevance | Semantic similarity |
| Fluency | Perplexity |

### 3. Cost & Latency

```python
@dataclass
class PerformanceMetrics:
    ttft: float      # Time to First Token (seconds)
    tpot: float      # Time Per Output Token
    throughput: float # Tokens/second

    def cost(self, input_tokens, output_tokens, prices):
        return input_tokens * prices["input"] + output_tokens * prices["output"]
```

## Model Selection Workflow

```
1. Define Requirements
   ├── Task type
   ├── Quality threshold
   ├── Latency requirements (<2s TTFT)
   ├── Cost budget
   └── Deployment constraints

2. Filter Options
   ├── API vs Self-hosted
   ├── Open source vs Proprietary
   └── Size constraints

3. Benchmark on Your Data
   ├── Create eval dataset (100+ examples)
   ├── Run experiments
   └── Analyze results

4. Make Decision
   └── Balance quality, cost, latency
```

## Build vs Buy

| Factor | API | Self-Host |
|--------|-----|-----------|
| Data Privacy | Less control | Full control |
| Performance | Best models | Slightly behind |
| Cost at Scale | Expensive | Amortized |
| Customization | Limited | Full control |
| Maintenance | Zero | Significant |

## Public Benchmarks

| Benchmark | Focus |
|-----------|-------|
| MMLU | Knowledge (57 subjects) |
| HumanEval | Code generation |
| GSM-8K | Math reasoning |
| TruthfulQA | Factuality |
| MT-Bench | Multi-turn chat |

**Caution**: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.

## Best Practices

1. Test on domain-specific data
2. Measure both quality and cost
3. Consider latency requirements
4. Plan for fallback models
5. Re-evaluate periodically

Overview

This skill provides an end-to-end framework for evaluating AI systems, covering model selection, benchmarking, cost and latency analysis, and build-vs-buy trade-offs. It guides teams to define requirements, run domain-specific benchmarks, and balance quality, cost, and deployment constraints. The goal is to make evaluation repeatable and decision-focused for production use.

How this skill works

The skill inspects task requirements, filters candidate models (API vs self-hosted, open vs proprietary), and runs benchmarks on your data using public and custom datasets. It measures generation quality (factuality, coherence, relevance, fluency) and performance metrics (time-to-first-token, throughput, cost per token). Results feed a decision matrix that weighs quality, latency, cost, and operational constraints to produce a recommended path.

When to use it

Selecting a model for a new product or feature
Designing an evaluation pipeline for production validation
Deciding between API-based and self-hosted deployments
Comparing multiple models on domain-specific datasets
Planning cost and latency constraints for real-time systems

Best practices

Always evaluate on your own domain-specific dataset (100+ examples recommended)
Measure both qualitative metrics (factuality, coherence) and operational metrics (TTFT, throughput, cost)
Set clear acceptance thresholds up front: quality, latency (<2s TTFT), and budget
Include fallback or lighter models for degraded conditions and graceful degradation
Re-run evaluations periodically and after model updates to detect drift or contamination

Example use cases

Benchmarking code-generation models with HumanEval or MBPP plus your internal tests
Evaluating math and reasoning models using GSM-8K and a proprietary problem set
Comparing API vs self-hosted options for a privacy-sensitive recommendation engine
Estimating production cost and latency trade-offs using TTFT, tokens/sec, and per-token pricing
Designing a multi-tier model stack: high-quality slow model + fast low-cost fallback

FAQ

How many samples should I use for evaluation?

Aim for 100+ representative examples per use case. More examples improve statistical confidence, but prioritize diversity and realism over sheer volume.

Which benchmarks are most useful?

Use public benchmarks as reference (MMLU, HumanEval, GSM-8K, MT-Bench) but prioritize evaluations on your domain data to avoid contamination and overfitting.