home / skills / doanchienthangdev / omgkit / ai-system-evaluation
This skill helps you evaluate end-to-end AI system performance, guiding model selection, benchmarking, cost and latency analysis, and architecture decisions.
npx playbooks add skill doanchienthangdev/omgkit --skill ai-system-evaluationReview the files below or copy the command above to add this skill to your agents.
---
name: ai-system-evaluation
description: End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.
---
# AI System Evaluation
Evaluating AI systems end-to-end.
## Evaluation Criteria
### 1. Domain-Specific Capability
| Domain | Benchmarks |
|--------|------------|
| Math & Reasoning | GSM-8K, MATH |
| Code | HumanEval, MBPP |
| Knowledge | MMLU, ARC |
| Multi-turn Chat | MT-Bench |
### 2. Generation Quality
| Criterion | Measurement |
|-----------|-------------|
| Factual Consistency | NLI, SAFE, SelfCheckGPT |
| Coherence | AI judge rubric |
| Relevance | Semantic similarity |
| Fluency | Perplexity |
### 3. Cost & Latency
```python
@dataclass
class PerformanceMetrics:
ttft: float # Time to First Token (seconds)
tpot: float # Time Per Output Token
throughput: float # Tokens/second
def cost(self, input_tokens, output_tokens, prices):
return input_tokens * prices["input"] + output_tokens * prices["output"]
```
## Model Selection Workflow
```
1. Define Requirements
├── Task type
├── Quality threshold
├── Latency requirements (<2s TTFT)
├── Cost budget
└── Deployment constraints
2. Filter Options
├── API vs Self-hosted
├── Open source vs Proprietary
└── Size constraints
3. Benchmark on Your Data
├── Create eval dataset (100+ examples)
├── Run experiments
└── Analyze results
4. Make Decision
└── Balance quality, cost, latency
```
## Build vs Buy
| Factor | API | Self-Host |
|--------|-----|-----------|
| Data Privacy | Less control | Full control |
| Performance | Best models | Slightly behind |
| Cost at Scale | Expensive | Amortized |
| Customization | Limited | Full control |
| Maintenance | Zero | Significant |
## Public Benchmarks
| Benchmark | Focus |
|-----------|-------|
| MMLU | Knowledge (57 subjects) |
| HumanEval | Code generation |
| GSM-8K | Math reasoning |
| TruthfulQA | Factuality |
| MT-Bench | Multi-turn chat |
**Caution**: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.
## Best Practices
1. Test on domain-specific data
2. Measure both quality and cost
3. Consider latency requirements
4. Plan for fallback models
5. Re-evaluate periodically
This skill provides an end-to-end framework for evaluating AI systems, covering model selection, benchmarking, cost and latency analysis, and build-vs-buy trade-offs. It guides teams to define requirements, run domain-specific benchmarks, and balance quality, cost, and deployment constraints. The goal is to make evaluation repeatable and decision-focused for production use.
The skill inspects task requirements, filters candidate models (API vs self-hosted, open vs proprietary), and runs benchmarks on your data using public and custom datasets. It measures generation quality (factuality, coherence, relevance, fluency) and performance metrics (time-to-first-token, throughput, cost per token). Results feed a decision matrix that weighs quality, latency, cost, and operational constraints to produce a recommended path.
How many samples should I use for evaluation?
Aim for 100+ representative examples per use case. More examples improve statistical confidence, but prioritize diversity and realism over sheer volume.
Which benchmarks are most useful?
Use public benchmarks as reference (MMLU, HumanEval, GSM-8K, MT-Bench) but prioritize evaluations on your domain data to avoid contamination and overfitting.