home / skills / 404kidwiz / claude-supercode-skills / performance-monitor-skill

performance-monitor-skill skill

This skill helps you monitor, benchmark, and optimize AI agent performance, tracking token usage, latency, costs, and evaluation quality.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill performance-monitor-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.4 KB

---
name: performance-monitor
description: Expert in observing, benchmarking, and optimizing AI agents. Specializes in token usage tracking, latency analysis, and quality evaluation metrics. Use when optimizing agent costs, measuring performance, or implementing evals. Triggers include "agent performance", "token usage", "latency optimization", "eval", "agent metrics", "cost optimization", "agent benchmarking".
---

# Performance Monitor

## Purpose
Provides expertise in monitoring, benchmarking, and optimizing AI agent performance. Specializes in token usage tracking, latency analysis, cost optimization, and implementing quality evaluation metrics (evals) for AI systems.

## When to Use
- Tracking token usage and costs for AI agents
- Measuring and optimizing agent latency
- Implementing evaluation metrics (evals)
- Benchmarking agent quality and accuracy
- Optimizing agent cost efficiency
- Building observability for AI pipelines
- Analyzing agent conversation patterns
- Setting up A/B testing for agents

## Quick Start
**Invoke this skill when:**
- Optimizing AI agent costs and token usage
- Measuring agent latency and performance
- Implementing evaluation frameworks
- Building observability for AI systems
- Benchmarking agent quality

**Do NOT invoke when:**
- General application performance → use `/performance-engineer`
- Infrastructure monitoring → use `/sre-engineer`
- ML model training optimization → use `/ml-engineer`
- Prompt design → use `/prompt-engineer`

## Decision Framework
```
Optimization Goal?
├── Cost Reduction
│   ├── Token usage → Prompt optimization
│   └── API calls → Caching, batching
├── Latency
│   ├── Time to first token → Streaming
│   └── Total response time → Model selection
├── Quality
│   ├── Accuracy → Evals with ground truth
│   └── Consistency → Multiple run analysis
└── Reliability
    └── Error rates, retry patterns
```

## Core Workflows

### 1. Token Usage Tracking
1. Instrument API calls to capture usage
2. Track input vs output tokens separately
3. Aggregate by agent, task, user
4. Calculate costs per operation
5. Build dashboards for visibility
6. Set alerts for anomalous usage

### 2. Eval Framework Setup
1. Define evaluation criteria
2. Create test dataset with expected outputs
3. Implement scoring functions
4. Run automated eval pipeline
5. Track scores over time
6. Use for regression testing

### 3. Latency Optimization
1. Measure baseline latency
2. Identify bottlenecks (model, network, parsing)
3. Implement streaming where applicable
4. Optimize prompt length
5. Consider model size tradeoffs
6. Add caching for repeated queries

## Best Practices
- Track tokens separately from API call counts
- Implement evals before optimizing
- Use percentiles (p50, p95, p99) not averages for latency
- Log prompt and response for debugging
- Set cost budgets and alerts
- Version prompts and track performance per version

## Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|--------------|---------|------------------|
| No token tracking | Surprise costs | Instrument all calls |
| Optimizing without evals | Quality regression | Measure before optimizing |
| Average-only latency | Hides tail latency | Use percentiles |
| No prompt versioning | Can't correlate changes | Version and track |
| Ignoring caching | Repeated costs | Cache stable responses |

Overview

This skill provides expert monitoring, benchmarking, and optimization for AI agents, focusing on token usage, latency, cost, and quality metrics. It helps teams instrument pipelines, run evals, and implement targeted optimizations to improve cost-efficiency and user experience.

How this skill works

The skill inspects API call telemetry to separate input/output token counts, measures latency percentiles, and collects quality signals via automated evals against ground truth. It aggregates metrics by agent, task, and version, surfaces anomalies and cost hotspots, and recommends fixes such as prompt changes, caching, streaming, or model selection tradeoffs.

When to use it

Tracking token usage and assigning costs per operation
Measuring agent latency with p50/p95/p99 percentiles
Setting up automated evaluation pipelines and regression tests
Benchmarking agent quality, accuracy, and consistency
Designing observability and alerts for AI pipelines
Planning A/B tests and comparing agent versions

Best practices

Instrument all API calls and record input vs output tokens separately
Use percentiles (p50/p95/p99) for latency analysis rather than averages
Implement evals before and after optimizations to avoid quality regressions
Version prompts and agent code to correlate changes with metrics
Set cost budgets, alerts, and anomaly detection for token spikes
Cache stable responses and batch similar requests when possible

Example use cases

Reduce monthly API spend by identifying high-token workflows and optimizing prompts
Improve user experience by cutting p95 latency with streaming and model selection
Detect regressions by running automated evals after deployments
Compare two agent versions with A/B testing and metric dashboards
Build dashboards that break down cost by agent, task, and user segment

FAQ

What telemetry should I collect first?

Start with timestamped API call logs, input/output token counts, model identifier, response time, and request identifiers to map flows.

How do I avoid reducing accuracy when optimizing cost?

Run evals on a representative test set before and after changes. Use scoring functions and regression windows to ensure performance stays within acceptable bounds.