home / skills / ruvnet / ruflo / agent-benchmark-suite

agent-benchmark-suite skill

safe

This skill analyzes and benchmarks multi-agent swarm performance, detects regressions, and generates actionable optimization recommendations to improve

npx playbooks add skill ruvnet/ruflo --skill agent-benchmark-suite

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

19.3 KB

---
name: agent-benchmark-suite
description: Agent skill for benchmark-suite - invoke with $agent-benchmark-suite
---

---
name: Benchmark Suite
type: agent
category: optimization
description: Comprehensive performance benchmarking, regression detection and performance validation
---

# Benchmark Suite Agent

## Agent Profile
- **Name**: Benchmark Suite
- **Type**: Performance Optimization Agent
- **Specialization**: Comprehensive performance benchmarking and testing
- **Performance Focus**: Automated benchmarking, regression detection, and performance validation

## Core Capabilities

### 1. Comprehensive Benchmarking Framework
```javascript
// Advanced benchmarking system
class ComprehensiveBenchmarkSuite {
  constructor() {
    this.benchmarks = {
      // Core performance benchmarks
      throughput: new ThroughputBenchmark(),
      latency: new LatencyBenchmark(),
      scalability: new ScalabilityBenchmark(),
      resource_usage: new ResourceUsageBenchmark(),
      
      // Swarm-specific benchmarks
      coordination: new CoordinationBenchmark(),
      load_balancing: new LoadBalancingBenchmark(),
      topology: new TopologyBenchmark(),
      fault_tolerance: new FaultToleranceBenchmark(),
      
      // Custom benchmarks
      custom: new CustomBenchmarkManager()
    };
    
    this.reporter = new BenchmarkReporter();
    this.comparator = new PerformanceComparator();
    this.analyzer = new BenchmarkAnalyzer();
  }
  
  // Execute comprehensive benchmark suite
  async runBenchmarkSuite(config = {}) {
    const suiteConfig = {
      duration: config.duration || 300000, // 5 minutes default
      iterations: config.iterations || 10,
      warmupTime: config.warmupTime || 30000, // 30 seconds
      cooldownTime: config.cooldownTime || 10000, // 10 seconds
      parallel: config.parallel || false,
      baseline: config.baseline || null
    };
    
    const results = {
      summary: {},
      detailed: new Map(),
      baseline_comparison: null,
      recommendations: []
    };
    
    // Warmup phase
    await this.warmup(suiteConfig.warmupTime);
    
    // Execute benchmarks
    if (suiteConfig.parallel) {
      results.detailed = await this.runBenchmarksParallel(suiteConfig);
    } else {
      results.detailed = await this.runBenchmarksSequential(suiteConfig);
    }
    
    // Generate summary
    results.summary = this.generateSummary(results.detailed);
    
    // Compare with baseline if provided
    if (suiteConfig.baseline) {
      results.baseline_comparison = await this.compareWithBaseline(
        results.detailed, 
        suiteConfig.baseline
      );
    }
    
    // Generate recommendations
    results.recommendations = await this.generateRecommendations(results);
    
    // Cooldown phase
    await this.cooldown(suiteConfig.cooldownTime);
    
    return results;
  }
  
  // Parallel benchmark execution
  async runBenchmarksParallel(config) {
    const benchmarkPromises = Object.entries(this.benchmarks).map(
      async ([name, benchmark]) => {
        const result = await this.executeBenchmark(benchmark, name, config);
        return [name, result];
      }
    );
    
    const results = await Promise.all(benchmarkPromises);
    return new Map(results);
  }
  
  // Sequential benchmark execution
  async runBenchmarksSequential(config) {
    const results = new Map();
    
    for (const [name, benchmark] of Object.entries(this.benchmarks)) {
      const result = await this.executeBenchmark(benchmark, name, config);
      results.set(name, result);
      
      // Brief pause between benchmarks
      await this.sleep(1000);
    }
    
    return results;
  }
}
```

### 2. Performance Regression Detection
```javascript
// Advanced regression detection system
class RegressionDetector {
  constructor() {
    this.detectors = {
      statistical: new StatisticalRegressionDetector(),
      machine_learning: new MLRegressionDetector(),
      threshold: new ThresholdRegressionDetector(),
      trend: new TrendRegressionDetector()
    };
    
    this.analyzer = new RegressionAnalyzer();
    this.alerting = new RegressionAlerting();
  }
  
  // Detect performance regressions
  async detectRegressions(currentResults, historicalData, config = {}) {
    const regressions = {
      detected: [],
      severity: 'none',
      confidence: 0,
      analysis: {}
    };
    
    // Run multiple detection algorithms
    const detectionPromises = Object.entries(this.detectors).map(
      async ([method, detector]) => {
        const detection = await detector.detect(currentResults, historicalData, config);
        return [method, detection];
      }
    );
    
    const detectionResults = await Promise.all(detectionPromises);
    
    // Aggregate detection results
    for (const [method, detection] of detectionResults) {
      if (detection.regression_detected) {
        regressions.detected.push({
          method,
          ...detection
        });
      }
    }
    
    // Calculate overall confidence and severity
    if (regressions.detected.length > 0) {
      regressions.confidence = this.calculateAggregateConfidence(regressions.detected);
      regressions.severity = this.calculateSeverity(regressions.detected);
      regressions.analysis = await this.analyzer.analyze(regressions.detected);
    }
    
    return regressions;
  }
  
  // Statistical regression detection using change point analysis
  async detectStatisticalRegression(metric, historicalData, sensitivity = 0.95) {
    // Use CUSUM (Cumulative Sum) algorithm for change point detection
    const cusum = this.calculateCUSUM(metric, historicalData);
    
    // Detect change points
    const changePoints = this.detectChangePoints(cusum, sensitivity);
    
    // Analyze significance of changes
    const analysis = changePoints.map(point => ({
      timestamp: point.timestamp,
      magnitude: point.magnitude,
      direction: point.direction,
      significance: point.significance,
      confidence: point.confidence
    }));
    
    return {
      regression_detected: changePoints.length > 0,
      change_points: analysis,
      cusum_statistics: cusum.statistics,
      sensitivity: sensitivity
    };
  }
  
  // Machine learning-based regression detection
  async detectMLRegression(metrics, historicalData) {
    // Train anomaly detection model on historical data
    const model = await this.trainAnomalyModel(historicalData);
    
    // Predict anomaly scores for current metrics
    const anomalyScores = await model.predict(metrics);
    
    // Identify regressions based on anomaly scores
    const threshold = this.calculateDynamicThreshold(anomalyScores);
    const regressions = anomalyScores.filter(score => score.anomaly > threshold);
    
    return {
      regression_detected: regressions.length > 0,
      anomaly_scores: anomalyScores,
      threshold: threshold,
      regressions: regressions,
      model_confidence: model.confidence
    };
  }
}
```

### 3. Automated Performance Testing
```javascript
// Comprehensive automated performance testing
class AutomatedPerformanceTester {
  constructor() {
    this.testSuites = {
      load: new LoadTestSuite(),
      stress: new StressTestSuite(),
      volume: new VolumeTestSuite(),
      endurance: new EnduranceTestSuite(),
      spike: new SpikeTestSuite(),
      configuration: new ConfigurationTestSuite()
    };
    
    this.scheduler = new TestScheduler();
    this.orchestrator = new TestOrchestrator();
    this.validator = new ResultValidator();
  }
  
  // Execute automated performance test campaign
  async runTestCampaign(config) {
    const campaign = {
      id: this.generateCampaignId(),
      config,
      startTime: Date.now(),
      tests: [],
      results: new Map(),
      summary: null
    };
    
    // Schedule test execution
    const schedule = await this.scheduler.schedule(config.tests, config.constraints);
    
    // Execute tests according to schedule
    for (const scheduledTest of schedule) {
      const testResult = await this.executeScheduledTest(scheduledTest);
      campaign.tests.push(scheduledTest);
      campaign.results.set(scheduledTest.id, testResult);
      
      // Validate results in real-time
      const validation = await this.validator.validate(testResult);
      if (!validation.valid) {
        campaign.summary = {
          status: 'failed',
          reason: validation.reason,
          failedAt: scheduledTest.name
        };
        break;
      }
    }
    
    // Generate campaign summary
    if (!campaign.summary) {
      campaign.summary = await this.generateCampaignSummary(campaign);
    }
    
    campaign.endTime = Date.now();
    campaign.duration = campaign.endTime - campaign.startTime;
    
    return campaign;
  }
  
  // Load testing with gradual ramp-up
  async executeLoadTest(config) {
    const loadTest = {
      type: 'load',
      config,
      phases: [],
      metrics: new Map(),
      results: {}
    };
    
    // Ramp-up phase
    const rampUpResult = await this.executeRampUp(config.rampUp);
    loadTest.phases.push({ phase: 'ramp-up', result: rampUpResult });
    
    // Sustained load phase
    const sustainedResult = await this.executeSustainedLoad(config.sustained);
    loadTest.phases.push({ phase: 'sustained', result: sustainedResult });
    
    // Ramp-down phase
    const rampDownResult = await this.executeRampDown(config.rampDown);
    loadTest.phases.push({ phase: 'ramp-down', result: rampDownResult });
    
    // Analyze results
    loadTest.results = await this.analyzeLoadTestResults(loadTest.phases);
    
    return loadTest;
  }
  
  // Stress testing to find breaking points
  async executeStressTest(config) {
    const stressTest = {
      type: 'stress',
      config,
      breakingPoint: null,
      degradationCurve: [],
      results: {}
    };
    
    let currentLoad = config.startLoad;
    let systemBroken = false;
    
    while (!systemBroken && currentLoad <= config.maxLoad) {
      const testResult = await this.applyLoad(currentLoad, config.duration);
      
      stressTest.degradationCurve.push({
        load: currentLoad,
        performance: testResult.performance,
        stability: testResult.stability,
        errors: testResult.errors
      });
      
      // Check if system is breaking
      if (this.isSystemBreaking(testResult, config.breakingCriteria)) {
        stressTest.breakingPoint = {
          load: currentLoad,
          performance: testResult.performance,
          reason: this.identifyBreakingReason(testResult)
        };
        systemBroken = true;
      }
      
      currentLoad += config.loadIncrement;
    }
    
    stressTest.results = await this.analyzeStressTestResults(stressTest);
    
    return stressTest;
  }
}
```

### 4. Performance Validation Framework
```javascript
// Comprehensive performance validation
class PerformanceValidator {
  constructor() {
    this.validators = {
      sla: new SLAValidator(),
      regression: new RegressionValidator(),
      scalability: new ScalabilityValidator(),
      reliability: new ReliabilityValidator(),
      efficiency: new EfficiencyValidator()
    };
    
    this.thresholds = new ThresholdManager();
    this.rules = new ValidationRuleEngine();
  }
  
  // Validate performance against defined criteria
  async validatePerformance(results, criteria) {
    const validation = {
      overall: {
        passed: true,
        score: 0,
        violations: []
      },
      detailed: new Map(),
      recommendations: []
    };
    
    // Run all validators
    const validationPromises = Object.entries(this.validators).map(
      async ([type, validator]) => {
        const result = await validator.validate(results, criteria[type]);
        return [type, result];
      }
    );
    
    const validationResults = await Promise.all(validationPromises);
    
    // Aggregate validation results
    for (const [type, result] of validationResults) {
      validation.detailed.set(type, result);
      
      if (!result.passed) {
        validation.overall.passed = false;
        validation.overall.violations.push(...result.violations);
      }
      
      validation.overall.score += result.score * (criteria[type]?.weight || 1);
    }
    
    // Normalize overall score
    const totalWeight = Object.values(criteria).reduce((sum, c) => sum + (c.weight || 1), 0);
    validation.overall.score /= totalWeight;
    
    // Generate recommendations
    validation.recommendations = await this.generateValidationRecommendations(validation);
    
    return validation;
  }
  
  // SLA validation
  async validateSLA(results, slaConfig) {
    const slaValidation = {
      passed: true,
      violations: [],
      score: 1.0,
      metrics: {}
    };
    
    // Validate each SLA metric
    for (const [metric, threshold] of Object.entries(slaConfig.thresholds)) {
      const actualValue = this.extractMetricValue(results, metric);
      const validation = this.validateThreshold(actualValue, threshold);
      
      slaValidation.metrics[metric] = {
        actual: actualValue,
        threshold: threshold.value,
        operator: threshold.operator,
        passed: validation.passed,
        deviation: validation.deviation
      };
      
      if (!validation.passed) {
        slaValidation.passed = false;
        slaValidation.violations.push({
          metric,
          actual: actualValue,
          expected: threshold.value,
          severity: threshold.severity || 'medium'
        });
        
        // Reduce score based on violation severity
        const severityMultiplier = this.getSeverityMultiplier(threshold.severity);
        slaValidation.score -= (validation.deviation * severityMultiplier);
      }
    }
    
    slaValidation.score = Math.max(0, slaValidation.score);
    
    return slaValidation;
  }
  
  // Scalability validation
  async validateScalability(results, scalabilityConfig) {
    const scalabilityValidation = {
      passed: true,
      violations: [],
      score: 1.0,
      analysis: {}
    };
    
    // Linear scalability analysis
    if (scalabilityConfig.linear) {
      const linearityAnalysis = this.analyzeLinearScalability(results);
      scalabilityValidation.analysis.linearity = linearityAnalysis;
      
      if (linearityAnalysis.coefficient < scalabilityConfig.linear.minCoefficient) {
        scalabilityValidation.passed = false;
        scalabilityValidation.violations.push({
          type: 'linearity',
          actual: linearityAnalysis.coefficient,
          expected: scalabilityConfig.linear.minCoefficient
        });
      }
    }
    
    // Efficiency retention analysis
    if (scalabilityConfig.efficiency) {
      const efficiencyAnalysis = this.analyzeEfficiencyRetention(results);
      scalabilityValidation.analysis.efficiency = efficiencyAnalysis;
      
      if (efficiencyAnalysis.retention < scalabilityConfig.efficiency.minRetention) {
        scalabilityValidation.passed = false;
        scalabilityValidation.violations.push({
          type: 'efficiency_retention',
          actual: efficiencyAnalysis.retention,
          expected: scalabilityConfig.efficiency.minRetention
        });
      }
    }
    
    return scalabilityValidation;
  }
}
```

## MCP Integration Hooks

### Benchmark Execution Integration
```javascript
// Comprehensive MCP benchmark integration
const benchmarkIntegration = {
  // Execute performance benchmarks
  async runBenchmarks(config = {}) {
    // Run benchmark suite
    const benchmarkResult = await mcp.benchmark_run({
      suite: config.suite || 'comprehensive'
    });
    
    // Collect detailed metrics during benchmarking
    const metrics = await mcp.metrics_collect({
      components: ['system', 'agents', 'coordination', 'memory']
    });
    
    // Analyze performance trends
    const trends = await mcp.trend_analysis({
      metric: 'performance',
      period: '24h'
    });
    
    // Cost analysis
    const costAnalysis = await mcp.cost_analysis({
      timeframe: '24h'
    });
    
    return {
      benchmark: benchmarkResult,
      metrics,
      trends,
      costAnalysis,
      timestamp: Date.now()
    };
  },
  
  // Quality assessment
  async assessQuality(criteria) {
    const qualityAssessment = await mcp.quality_assess({
      target: 'swarm-performance',
      criteria: criteria || [
        'throughput',
        'latency',
        'reliability',
        'scalability',
        'efficiency'
      ]
    });
    
    return qualityAssessment;
  },
  
  // Error pattern analysis
  async analyzeErrorPatterns() {
    // Collect system logs
    const logs = await this.collectSystemLogs();
    
    // Analyze error patterns
    const errorAnalysis = await mcp.error_analysis({
      logs: logs
    });
    
    return errorAnalysis;
  }
};
```

## Operational Commands

### Benchmarking Commands
```bash
# Run comprehensive benchmark suite
npx claude-flow benchmark-run --suite comprehensive --duration 300

# Execute specific benchmark
npx claude-flow benchmark-run --suite throughput --iterations 10

# Compare with baseline
npx claude-flow benchmark-compare --current <results> --baseline <baseline>

# Quality assessment
npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency

# Performance validation
npx claude-flow validate-performance --results <file> --criteria <file>
```

### Regression Detection Commands
```bash
# Detect performance regressions
npx claude-flow detect-regression --current <results> --historical <data>

# Set up automated regression monitoring
npx claude-flow regression-monitor --enable --sensitivity 0.95

# Analyze error patterns
npx claude-flow error-analysis --logs <log-files>
```

## Integration Points

### With Other Optimization Agents
- **Performance Monitor**: Provides continuous monitoring data for benchmarking
- **Load Balancer**: Validates load balancing effectiveness through benchmarks
- **Topology Optimizer**: Tests topology configurations for optimal performance

### With CI/CD Pipeline
- **Automated Testing**: Integrates with CI/CD for continuous performance validation
- **Quality Gates**: Provides pass$fail criteria for deployment decisions
- **Regression Prevention**: Catches performance regressions before production

## Performance Benchmarks

### Standard Benchmark Suite
```javascript
// Comprehensive benchmark definitions
const standardBenchmarks = {
  // Throughput benchmarks
  throughput: {
    name: 'Throughput Benchmark',
    metrics: ['requests_per_second', 'tasks_per_second', 'messages_per_second'],
    duration: 300000, // 5 minutes
    warmup: 30000,    // 30 seconds
    targets: {
      requests_per_second: { min: 1000, optimal: 5000 },
      tasks_per_second: { min: 100, optimal: 500 },
      messages_per_second: { min: 10000, optimal: 50000 }
    }
  },
  
  // Latency benchmarks
  latency: {
    name: 'Latency Benchmark',
    metrics: ['p50', 'p90', 'p95', 'p99', 'max'],
    duration: 300000,
    targets: {
      p50: { max: 100 },   // 100ms
      p90: { max: 200 },   // 200ms
      p95: { max: 500 },   // 500ms
      p99: { max: 1000 },  // 1s
      max: { max: 5000 }   // 5s
    }
  },
  
  // Scalability benchmarks
  scalability: {
    name: 'Scalability Benchmark',
    metrics: ['linear_coefficient', 'efficiency_retention'],
    load_points: [1, 2, 4, 8, 16, 32, 64],
    targets: {
      linear_coefficient: { min: 0.8 },
      efficiency_retention: { min: 0.7 }
    }
  }
};
```

This Benchmark Suite agent provides comprehensive automated performance testing, regression detection, and validation capabilities to ensure optimal swarm performance and prevent performance degradation.

Overview

This skill provides a full-featured benchmark suite agent for performance optimization, regression detection, and validation. It automates end-to-end benchmarking campaigns, compares results to baselines, and produces actionable recommendations. The agent is designed for multi-agent swarm environments and distributed systems where throughput, latency, scalability, and fault tolerance matter.

How this skill works

The agent runs configured benchmark suites (throughput, latency, scalability, resource usage, swarm coordination, fault tolerance, and custom tests) either sequentially or in parallel, with warmup and cooldown phases. It aggregates detailed results, generates summaries, and compares outcomes against baselines. Built-in regression detectors use statistical methods and ML anomaly models to flag regressions and estimate severity. Automated test campaigns, validators, and a recommendation engine produce remediation guidance and validation reports.

When to use it

Before or after deploying swarm-based services to validate performance against SLAs.
When introducing code changes, infrastructure updates, or model upgrades to detect regressions.
To run scheduled or continuous performance campaigns for capacity planning.
When testing fault tolerance, load balancing, and coordination behavior in multi-agent systems.
For validating scalability and discovering breaking points with stress and endurance tests.

Best practices

Provide a representative baseline dataset and historical metrics for accurate regression detection.
Use warmup and cooldown windows to remove transient noise from measurements.
Run repeatable campaigns with fixed seeds and configuration snapshots for reliable comparisons.
Start with sequential runs for stability, then enable parallel mode to stress orchestration layers.
Combine statistical and ML detectors to balance sensitivity and false-positive control.

Example use cases

Validate that a new Claude Code agent release preserves throughput and latency compared to the last stable baseline.
Execute a stress campaign to identify the breaking point and generate a degradation curve for capacity planning.
Run nightly automated performance campaigns that detect regressions and open alerts when severity thresholds are exceeded.
Benchmark coordination and load-balancing strategies across a distributed swarm to choose the best topology.
Validate SLA compliance for response time and error rates before a production rollout.

FAQ

Can I compare current runs to historical baselines?

Yes. The suite supports baseline comparisons and aggregates differences with confidence and severity scoring.

How are regressions detected?

It uses multiple detectors—statistical change point (CUSUM), ML-based anomaly models, threshold checks, and trend analysis—then aggregates results for confidence and severity.