home / skills / mjunaidca / mjs-agent-skills / operating-production-services

operating-production-services skill

safe

/.claude/skills/operating-production-services

This skill helps you implement reliability targets and incident workflows by applying SLOs, error budgets, and postmortems across services.

npx playbooks add skill mjunaidca/mjs-agent-skills --skill operating-production-services

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

4.8 KB

---
name: operating-production-services
description: |
  SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response.
  Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing
  on-call practices. NOT for initial service development (use scaffolding skills instead).
---

# Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

## Quick Reference

| Need | Go To |
|------|-------|
| Define reliability targets | [SLOs & Error Budgets](#slos--error-budgets) |
| Write incident report | [Postmortem Templates](#postmortem-templates) |
| Set up SLO alerting | [references/slo-alerting.md](references/slo-alerting.md) |

---

## SLOs & Error Budgets

### The Hierarchy

```
SLA (Contract) → SLO (Target) → SLI (Measurement)
```

### Common SLIs

```promql
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
```

### SLO Targets Reality Check

| SLO % | Downtime/Month | Downtime/Year |
|-------|----------------|---------------|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |

**Don't aim for 100%.** Each nine costs exponentially more.

### Error Budget

```
Error Budget = 1 - SLO Target
```

**Example:** 99.9% SLO = 0.1% error budget = 43 minutes/month

**Policy:**
| Budget Remaining | Action |
|------------------|--------|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |

See [references/slo-alerting.md](references/slo-alerting.md) for Prometheus recording rules and multi-window burn rate alerts.

---

## Postmortem Templates

### The Blameless Principle

| Blame-Focused | Blameless |
|---------------|-----------|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |

### When to Write Postmortems

- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes

### Standard Template

```markdown
# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
```

### Quick Template (Minor Incidents)

```markdown
# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
```

---

## Postmortem Meeting Guide

### Structure (60 min)

1. **Opening (5 min)** - Remind: "We're here to learn, not blame"
2. **Timeline (15 min)** - Walk through events chronologically
3. **Analysis (20 min)** - What failed? Why? What allowed it?
4. **Action Items (15 min)** - Prioritize, assign owners, set dates
5. **Closing (5 min)** - Summarize learnings, confirm owners

### Facilitation Tips

- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants

---

## Anti-Patterns

| Don't | Do Instead |
|-------|------------|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |

---

## Verification

Run: `python scripts/verify.py`

## References

- [references/slo-alerting.md](references/slo-alerting.md) - Prometheus rules, burn rate alerts, Grafana dashboards

Overview

This skill provides Site Reliability Engineering (SRE) patterns for operating production services, focusing on SLOs, error budgets, postmortems, and incident response. It helps teams set measurable reliability targets, run blameless postmortems, and implement alerting and on-call practices to reduce downtime and improve learning from failures.

How this skill works

The skill explains the hierarchy of SLA → SLO → SLI and offers common SLI queries and realistic SLO targets with their downtime implications. It defines error budget policies and alerting burn rates, supplies standard and quick postmortem templates, and outlines a structured postmortem meeting and facilitation tips. Practical examples, anti-patterns, and verification steps round out the guidance.

When to use it

Defining or reviewing service reliability targets and SLIs
Implementing SLO-based alerting and error budget burn rate rules
Writing blameless postmortems after SEV1/SEV2 incidents or customer-facing outages
Establishing on-call processes and incident response runbooks
Deciding change velocity based on remaining error budget

Best practices

Measure what matters: choose SLIs that reflect customer experience (availability, latency, error rate)
Set realistic SLOs (avoid aiming for 100%) and translate them into monthly/annual downtime
Use error budget policy to drive deployment and risk decisions (normal velocity → freeze)
Run blameless postmortems with a timeline, 5 Whys, clear impact, and assigned action items
Ensure every action has an owner, due date, and ticket to avoid orphaned tasks

Example use cases

Create a 99.9% availability SLO, calculate the monthly error budget, and set burn-rate alerts in Prometheus
Execute a blameless postmortem after a 30-minute customer outage using the standard template and meeting guide
Define on-call escalation steps and measure mean time to acknowledge and resolve via SLIs
Use quick postmortems for minor incidents to capture root cause and track short-term fixes
Adopt an anti-pattern checklist to prevent overly costly targets and avoid creating busywork action items

FAQ

How do I pick an SLI that reflects user experience?

Choose metrics that map directly to customer-facing outcomes: request success rate for availability, request latency percentiles for responsiveness, and error rates for correctness. Validate by comparing metric changes to real user impact.

When should I write a full postmortem versus a quick one?

Write a full postmortem for SEV1/SEV2 incidents, outages longer than 15 minutes, data loss or security events, and novel failure modes. Use the quick template for short, low-severity incidents to capture essential facts and fixes.