home / skills / shaul1991 / shaul-agents-plugin / sre-oncall

sre-oncall skill

/skills/sre-oncall

This skill assists on-call engineers by handling incident response, escalation, recovery, and postmortems to shorten downtime.

npx playbooks add skill shaul1991/shaul-agents-plugin --skill sre-oncall

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
471 B
---
name: sre-oncall
description: SRE Oncall Agent. 장애 대응, 에스컬레이션, 포스트모템을 담당합니다.
allowed-tools: Read, Write, Edit, Bash, Grep, Glob, WebSearch
---

# SRE Oncall Agent

## 역할
인시던트 대응 및 장애 관리를 담당합니다.

## 담당 업무
- 인시던트 대응
- 에스컬레이션
- 복구 수행
- 포스트모템 작성

## 산출물 위치
- 인시던트 로그: `incidents/`
- 포스트모템: `postmortems/`

Overview

This skill is the SRE Oncall Agent that handles incident response, escalation, recovery, and postmortem documentation. It acts as an operational companion for on-call engineers, coordinating actions and producing required artifacts. The agent focuses on clear handoffs, traceable incident logs, and actionable postmortems to reduce recurrence.

How this skill works

The agent inspects incoming alerts and incident metadata, categorizes severity, and triggers escalation paths when needed. It coordinates recovery steps, records timeline events into the incident log, and compiles a structured postmortem after resolution. All artifacts are stored in a predictable location for review and follow-up.

When to use it

  • When an alert indicates a production service outage or degradation
  • During on-call shifts to coordinate triage and escalation
  • To document recovery actions and create a reproducible timeline
  • After incident resolution to create a postmortem and lessons learned
  • When enforcing escalation policy and tracking accountability

Best practices

  • Triage quickly: assign severity and initial owner within the first 5–10 minutes
  • Log every action and decision in the incident timeline for traceability
  • Follow predefined escalation policies and update them when gaps appear
  • Keep postmortems factual, blameless, and focused on systemic fixes
  • Store incident logs and postmortems in the designated directories for easy retrieval

Example use cases

  • A sudden database failure where the agent triggers escalation and documents recovery steps
  • A latency spike in an API where the agent coordinates mitigation and records timeline events
  • After deploying a fix, the agent compiles a postmortem outlining root cause and preventative tasks
  • On-call rotation handoffs where the agent provides a concise incident summary and status
  • Periodic audits to review incident logs and identify recurring failure modes

FAQ

Where are incident logs and postmortems stored?

Incident logs are kept in the incidents/ directory and postmortems in postmortems/ for consistent access.

How does escalation work?

The agent uses severity classification to follow configured escalation paths, notifying on-call personnel and managers as required.