home / skills / a5c-ai / babysitter / performance-benchmark-suite

performance-benchmark-suite skill

safe

/plugins/babysitter/skills/babysit/process/specializations/sdk-platform-development/skills/performance-benchmark-suite

This skill benchmarks SDK performance across versions, detects regressions, and generates visual reports to guide optimization and release decisions.

npx playbooks add skill a5c-ai/babysitter --skill performance-benchmark-suite

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

2.1 KB

---
name: performance-benchmark-suite
description: SDK performance benchmarking and regression detection
allowed-tools:
  - Read
  - Write
  - Edit
  - Glob
  - Grep
  - Bash
---

# Performance Benchmark Suite Skill

## Overview

This skill implements comprehensive SDK performance benchmarking, tracking latency, throughput, memory usage, and detecting performance regressions across versions.

## Capabilities

- Measure latency percentiles (p50, p95, p99)
- Track memory usage and allocation patterns
- Detect performance regressions automatically
- Generate visual benchmark reports
- Compare performance across SDK versions
- Implement microbenchmarks for critical paths
- Configure continuous benchmarking in CI
- Support load testing scenarios

## Target Processes

- Performance Benchmarking
- SDK Testing Strategy
- SDK Versioning and Release Management

## Integration Points

- k6 for load testing
- Artillery for HTTP benchmarking
- hyperfine for CLI benchmarking
- Benchmark.js for JavaScript
- pytest-benchmark for Python
- Continuous benchmark systems (Bencher)

## Input Requirements

- Performance requirements (SLOs)
- Benchmark scenarios
- Baseline versions for comparison
- Environment specifications
- Reporting requirements

## Output Artifacts

- Benchmark test suite
- Performance baseline data
- Regression detection rules
- Visual benchmark reports
- CI benchmark configuration
- Historical trend analysis

## Usage Example

```yaml
skill:
  name: performance-benchmark-suite
  context:
    tool: k6
    scenarios:
      - name: basic-crud
        operations: ["create", "read", "update", "delete"]
        vus: 10
        duration: "30s"
      - name: high-load
        vus: 100
        duration: "5m"
    slos:
      p95_latency: "100ms"
      p99_latency: "500ms"
      error_rate: "0.1%"
    compareWith: "v1.0.0"
    regressionThreshold: "10%"
```

## Best Practices

1. Establish baselines before optimization
2. Track percentiles, not just averages
3. Run benchmarks in consistent environments
4. Automate regression detection in CI
5. Monitor memory alongside latency
6. Document benchmark methodology

Overview

This skill implements an SDK performance benchmarking and regression detection suite for JavaScript projects. It measures latency, throughput, memory usage, and generates visual reports to compare SDK versions. The suite is designed to run locally or in CI and to alert on regressions against established baselines.

How this skill works

The skill runs microbenchmarks and load tests using tools like Benchmark.js, k6, Artillery, and hyperfine, collecting percentiles (p50, p95, p99), throughput, and memory metrics. It stores baseline data, applies configurable regression rules, and produces trend reports and CI-friendly pass/fail outputs. Visual reports and historical analysis help pinpoint regressions and performance hotspots.

When to use it

Before and after SDK releases to detect regressions
When validating performance impact of code changes or refactors
To establish and track performance baselines and SLO compliance
During CI runs for automated guardrails against slowdowns
For load testing critical API paths and client workflows

Best practices

Define clear SLOs and regression thresholds before benchmarking
Record percentiles (p50/p95/p99) and memory metrics, not just averages
Run benchmarks in consistent, isolated environments to reduce noise
Automate benchmark execution and regression checks in CI pipelines
Keep baseline versions and methodology documented for reproducibility

Example use cases

Compare p95 latency between v1.0.0 and current branch to detect regressions
Run k6 scenarios for CRUD operations to validate load behavior
Use hyperfine to benchmark CLI command performance after optimization
Integrate Benchmark.js microbenchmarks for hot-path function profiling
Fail CI when regression exceeds a configured percentage for p99 latency

FAQ

What inputs are needed to run the suite?

Provide benchmark scenarios, SLOs, baseline version, environment specs, and any CI configuration.

How are regressions detected?

The suite compares current results to baseline data using configurable thresholds and flags failures if a metric exceeds the regression threshold.