home / skills / personamanagmentlayer / pcl / sre-expert

sre-expert skill

safe

This skill provides expert guidance on SLOs, incident management, and reliability practices to improve site reliability and operational excellence.

npx playbooks add skill personamanagmentlayer/pcl --skill sre-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

13.4 KB

---
name: sre-expert
version: 1.0.0
description: Expert-level site reliability engineering, SLOs, incident management, and operational excellence
category: devops
tags: [sre, reliability, monitoring, incident-management, slo, observability]
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash(*)
---

# Site Reliability Engineering Expert

Expert guidance for SRE practices, reliability engineering, SLOs/SLIs, incident management, and operational excellence.

## Core Concepts

### SRE Fundamentals
- Service Level Objectives (SLOs)
- Service Level Indicators (SLIs)
- Error budgets
- Toil reduction
- Monitoring and alerting
- Capacity planning

### Reliability Practices
- Incident management
- Post-incident reviews (PIRs)
- On-call rotations
- Chaos engineering
- Disaster recovery
- Change management

### Automation
- Infrastructure as Code
- Configuration management
- Deployment automation
- Self-healing systems
- Runbook automation
- Automated remediation

## SLO/SLI Management

```python
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Dict
import numpy as np

@dataclass
class SLI:
    """Service Level Indicator"""
    name: str
    description: str
    query: str
    unit: str  # 'percentage', 'milliseconds', etc.

@dataclass
class SLO:
    """Service Level Objective"""
    name: str
    sli: SLI
    target: float
    window_days: int

class SLOTracker:
    """Track and manage SLOs"""

    def __init__(self):
        self.slos: Dict[str, SLO] = {}
        self.measurements: Dict[str, List[Dict]] = {}

    def define_slo(self, slo: SLO):
        """Define a new SLO"""
        self.slos[slo.name] = slo
        self.measurements[slo.name] = []

    def record_measurement(self, slo_name: str, value: float, timestamp: datetime):
        """Record SLI measurement"""
        if slo_name in self.slos:
            self.measurements[slo_name].append({
                'value': value,
                'timestamp': timestamp
            })

    def calculate_slo_compliance(self, slo_name: str) -> Dict:
        """Calculate SLO compliance"""
        slo = self.slos.get(slo_name)
        if not slo:
            return {}

        measurements = self.measurements.get(slo_name, [])
        window_start = datetime.now() - timedelta(days=slo.window_days)

        recent_measurements = [
            m for m in measurements
            if m['timestamp'] > window_start
        ]

        if not recent_measurements:
            return {'status': 'no_data'}

        values = [m['value'] for m in recent_measurements]
        actual = np.mean(values)

        return {
            'slo_name': slo_name,
            'target': slo.target,
            'actual': actual,
            'compliant': actual >= slo.target,
            'window_days': slo.window_days,
            'sample_count': len(recent_measurements)
        }

    def calculate_error_budget(self, slo_name: str) -> Dict:
        """Calculate remaining error budget"""
        compliance = self.calculate_slo_compliance(slo_name)

        if compliance.get('status') == 'no_data':
            return {'status': 'no_data'}

        target = compliance['target']
        actual = compliance['actual']

        error_budget_target = 100 - target
        errors_actual = 100 - actual

        remaining = error_budget_target - errors_actual
        remaining_pct = (remaining / error_budget_target) * 100 if error_budget_target > 0 else 100

        return {
            'slo_name': slo_name,
            'error_budget_target': error_budget_target,
            'errors_actual': errors_actual,
            'remaining': remaining,
            'remaining_percentage': remaining_pct,
            'exhausted': remaining < 0
        }

# Example SLOs
def define_standard_slos() -> List[SLO]:
    """Define standard SLOs for a web service"""
    return [
        SLO(
            name="api_availability",
            sli=SLI(
                name="availability",
                description="Percentage of successful requests",
                query="sum(rate(http_requests_total{code!~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100",
                unit="percentage"
            ),
            target=99.9,
            window_days=30
        ),
        SLO(
            name="api_latency",
            sli=SLI(
                name="latency_p95",
                description="95th percentile latency",
                query="histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
                unit="seconds"
            ),
            target=0.5,  # 500ms
            window_days=30
        )
    ]
```

## Incident Management

```python
from enum import Enum
from datetime import datetime
from typing import List, Optional

class Severity(Enum):
    SEV1 = "sev1"  # Critical
    SEV2 = "sev2"  # High
    SEV3 = "sev3"  # Medium
    SEV4 = "sev4"  # Low

class IncidentStatus(Enum):
    INVESTIGATING = "investigating"
    IDENTIFIED = "identified"
    MONITORING = "monitoring"
    RESOLVED = "resolved"

@dataclass
class Incident:
    incident_id: str
    title: str
    severity: Severity
    status: IncidentStatus
    started_at: datetime
    detected_at: datetime
    resolved_at: Optional[datetime]
    incident_commander: str
    responders: List[str]
    affected_services: List[str]
    timeline: List[Dict]
    root_cause: Optional[str] = None

class IncidentManager:
    """Manage incidents following SRE best practices"""

    def __init__(self):
        self.incidents: Dict[str, Incident] = {}

    def create_incident(self, incident: Incident) -> str:
        """Create new incident"""
        self.incidents[incident.incident_id] = incident

        # Notify on-call
        self.notify_oncall(incident)

        # Start incident timeline
        self.add_timeline_event(
            incident.incident_id,
            "Incident created",
            datetime.now()
        )

        return incident.incident_id

    def update_status(self, incident_id: str, new_status: IncidentStatus,
                     note: str):
        """Update incident status"""
        if incident_id in self.incidents:
            incident = self.incidents[incident_id]
            incident.status = new_status

            self.add_timeline_event(
                incident_id,
                f"Status changed to {new_status.value}: {note}",
                datetime.now()
            )

            if new_status == IncidentStatus.RESOLVED:
                incident.resolved_at = datetime.now()

    def add_timeline_event(self, incident_id: str, event: str,
                          timestamp: datetime):
        """Add event to incident timeline"""
        if incident_id in self.incidents:
            self.incidents[incident_id].timeline.append({
                'timestamp': timestamp,
                'event': event
            })

    def calculate_mttr(self, incident_id: str) -> Optional[float]:
        """Calculate Mean Time To Resolution"""
        incident = self.incidents.get(incident_id)

        if incident and incident.resolved_at:
            duration = incident.resolved_at - incident.detected_at
            return duration.total_seconds() / 60  # minutes

        return None

    def generate_incident_report(self, incident_id: str) -> Dict:
        """Generate incident report"""
        incident = self.incidents.get(incident_id)

        if not incident:
            return {}

        return {
            'incident_id': incident.incident_id,
            'title': incident.title,
            'severity': incident.severity.value,
            'status': incident.status.value,
            'duration_minutes': self.calculate_mttr(incident_id),
            'affected_services': incident.affected_services,
            'incident_commander': incident.incident_commander,
            'responders': incident.responders,
            'timeline': incident.timeline,
            'root_cause': incident.root_cause
        }

    def notify_oncall(self, incident: Incident):
        """Notify on-call engineer (integrate with PagerDuty, etc.)"""
        # Implementation would integrate with alerting system
        pass
```

## Monitoring and Alerting

```python
from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')
active_connections = Gauge('active_connections', 'Number of active connections')

class MonitoringSystem:
    """Implement monitoring best practices"""

    def __init__(self):
        self.alerts = []

    def record_request(self, method: str, endpoint: str, status: int, duration: float):
        """Record HTTP request metrics"""
        request_count.labels(method=method, endpoint=endpoint, status=status).inc()
        request_duration.observe(duration)

    def define_alert(self, name: str, expression: str, threshold: float,
                    duration: str, severity: str) -> Dict:
        """Define alerting rule"""
        alert = {
            'name': name,
            'expression': expression,
            'threshold': threshold,
            'duration': duration,
            'severity': severity,
            'annotations': {
                'summary': f'{name} alert triggered',
                'runbook_url': f'https://runbooks.example.com/{name}'
            }
        }

        self.alerts.append(alert)
        return alert

    def check_golden_signals(self, metrics: Dict) -> Dict:
        """Check the four golden signals"""
        return {
            'latency': self._check_latency(metrics.get('latency', [])),
            'traffic': self._check_traffic(metrics.get('traffic', 0)),
            'errors': self._check_errors(metrics.get('error_rate', 0)),
            'saturation': self._check_saturation(metrics.get('cpu_usage', 0))
        }

    def _check_latency(self, latencies: List[float]) -> Dict:
        if not latencies:
            return {'status': 'unknown'}

        p95 = np.percentile(latencies, 95)
        return {
            'status': 'critical' if p95 > 1000 else 'ok',
            'p95_ms': p95
        }

    def _check_traffic(self, requests_per_second: float) -> Dict:
        return {
            'status': 'ok',
            'rps': requests_per_second
        }

    def _check_errors(self, error_rate: float) -> Dict:
        return {
            'status': 'critical' if error_rate > 1.0 else 'ok',
            'error_rate': error_rate
        }

    def _check_saturation(self, cpu_usage: float) -> Dict:
        return {
            'status': 'warning' if cpu_usage > 80 else 'ok',
            'cpu_usage': cpu_usage
        }
```

## Chaos Engineering

```python
import random
from typing import Callable

class ChaosExperiment:
    """Run chaos engineering experiments"""

    def __init__(self, name: str, hypothesis: str):
        self.name = name
        self.hypothesis = hypothesis
        self.results = []

    def inject_latency(self, service_call: Callable, delay_ms: int):
        """Inject latency into service call"""
        time.sleep(delay_ms / 1000)
        return service_call()

    def inject_failure(self, service_call: Callable, failure_rate: float):
        """Randomly fail service calls"""
        if random.random() < failure_rate:
            raise Exception("Chaos: Simulated failure")
        return service_call()

    def kill_random_instance(self, instances: List[str]) -> str:
        """Kill random instance"""
        victim = random.choice(instances)
        # Implementation would actually kill the instance
        return victim

    def run_experiment(self, experiment_func: Callable) -> Dict:
        """Run chaos experiment"""
        start_time = datetime.now()

        try:
            result = experiment_func()
            status = "success"
            error = None
        except Exception as e:
            result = None
            status = "failed"
            error = str(e)

        end_time = datetime.now()

        experiment_result = {
            'name': self.name,
            'hypothesis': self.hypothesis,
            'status': status,
            'result': result,
            'error': error,
            'duration': (end_time - start_time).total_seconds(),
            'timestamp': start_time
        }

        self.results.append(experiment_result)
        return experiment_result
```

## Best Practices

### SRE Principles
- Embrace risk management
- Set SLOs based on user experience
- Use error budgets for decision making
- Automate toil away
- Monitor the four golden signals
- Practice blameless post-mortems
- Gradual rollouts and canary deployments

### Incident Management
- Clear incident severity definitions
- Defined incident commander role
- Communicate proactively
- Document timeline during incident
- Conduct post-incident reviews
- Track action items to completion
- Share learnings across teams

### On-Call
- Reasonable on-call rotations
- Comprehensive runbooks
- Alert on symptoms, not causes
- Actionable alerts only
- Escalation policies
- Support on-call engineers
- Measure and reduce alert fatigue

## Anti-Patterns

❌ No SLOs defined
❌ Alerts without runbooks
❌ Blame culture for incidents
❌ No post-incident reviews
❌ 100% uptime expectations
❌ Toil not tracked or reduced
❌ Manual processes for common tasks

## Resources

- Google SRE Book: https://sre.google/sre-book/table-of-contents/
- Site Reliability Engineering: https://sre.google/
- SLO Workshop: https://github.com/google/slo-workshop
- Chaos Engineering: https://principlesofchaos.org/
- Prometheus: https://prometheus.io/

Overview

This skill provides expert-level Site Reliability Engineering guidance for defining and operating reliable production services. It covers SLO/SLI design, error budget management, incident response, monitoring and alerting, automation, chaos experiments, and operational best practices. The content is practical and focused on outcomes: measurable reliability, reduced toil, and faster incident resolution.

How this skill works

The skill inspects and codifies core SRE concepts: SLOs and SLIs, error-budget calculations, incident lifecycles, monitoring metrics, and chaos experiments. It provides concrete patterns for measuring compliance, calculating remaining error budget, managing incidents (including MTTR and post-incident reports), defining alerts and runbooks, and automating routine operational tasks. Outputs include compliance reports, incident reports, alert definitions, and experiment results to inform action.

When to use it

When establishing or revising SLOs and SLIs for customer-facing services
During on-call playbook creation, runbook authoring, or alert tuning
When defining incident management processes, severity levels, and MTTR targets
To prioritize automation opportunities and reduce toil through IaC and remediation
When running chaos experiments to validate resilience and recovery processes
When building monitoring rules and defining golden-signal checks

Best practices

Define SLOs that map to user experience and measure meaningful SLIs
Use error budgets to balance feature rollout and operational risk
Alert on symptoms with clear runbooks; avoid noisy, non-actionable alerts
Practice blameless post-incident reviews and track action items to completion
Automate repetitive tasks (IaC, deployment, remediation) to reduce toil
Adopt gradual rollouts, canaries, and capacity planning tied to metrics

Example use cases

Create a 30-day availability SLO for APIs and calculate remaining error budget
Implement an incident workflow with roles, timeline events, MTTR calculation, and post-incident report
Define Prometheus alerting rules with runbook links and severity labels
Automate deployment and rollback using infrastructure-as-code and runbook automation
Run chaos experiments to validate failover, latency injection, and recovery procedures

FAQ

How do I choose an SLO target?

Choose targets based on critical user journeys and acceptable impact; use telemetry to map user experience to SLI values and set targets that balance reliability with product velocity.

What should an actionable alert include?

An actionable alert includes the symptom, threshold, severity, a short remediation runbook or link, and suggested next steps for the on-call engineer.