home / skills / openclaw / skills / pager-triage

pager-triage skill

/skills/tkuehnl/pager-triage

This skill helps you triage PagerDuty incidents by listing active alerts, diving into timelines, and coordinating acknowledgments or resolutions.

npx playbooks add skill openclaw/skills --skill pager-triage

Review the files below or copy the command above to add this skill to your agents.

Files (10)
SKILL.md
11.4 KB
---
name: pager-triage
version: 0.1.1
displayName: PagerDuty Incident Triage
description: >
  AI-powered incident triage for PagerDuty. List active incidents, deep-dive
  with timeline and alert correlation, check on-call schedules, acknowledge,
  resolve, and annotate — all from your agent. Read-only by default;
  write operations require explicit --confirm.
author: Anvil AI
tags:
  - pagerduty
  - opsgenie
  - incident
  - sre
  - oncall
  - triage
  - discord
  - discord-v2
tools:
  - name: pd_incidents
    description: List active PagerDuty incidents (triggered + acknowledged)
  - name: pd_incident_detail
    description: Get detailed incident info including timeline, alerts, and notes
  - name: pd_oncall
    description: Show current on-call schedules and escalation policies
  - name: pd_incident_ack
    description: Acknowledge an incident (requires --confirm)
  - name: pd_incident_resolve
    description: Resolve an incident (requires --confirm)
  - name: pd_incident_note
    description: Add a note to an incident (requires --confirm)
  - name: pd_services
    description: List PagerDuty services and their current status
  - name: pd_recent
    description: Show recent incidents for a service (last 24h/7d/30d)
---

# PagerDuty Incident Triage

AI-powered incident triage for PagerDuty. Read-only by default. Write operations require explicit confirmation.

## When to Activate

Use this skill when the user:

- Asks **"what's firing?"**, **"any incidents?"**, **"what's on fire?"**, or anything about active alerts
- Mentions **PagerDuty**, **pager**, **on-call**, or **incidents**
- Asks **"who's on call?"** or **"who's oncall right now?"**
- Wants to **acknowledge**, **resolve**, or **add a note** to an incident
- Says **"triage"**, **"incident response"**, or **"what happened last night?"**
- Asks about **service health** or **recent incidents** for a service

## Quick Setup

### 1. Create a PagerDuty API Key

1. Go to **PagerDuty → Settings → API Access Keys**
2. Click **"Create New API Key"**
3. Name it `OpenClaw Agent`
4. Select **Read-only** (recommended for triage; choose Full Access only if you need ack/resolve)
5. Copy the key

### 2. Set Environment Variables

```bash
# Required — your PagerDuty REST API v2 token
export PAGERDUTY_API_KEY="u+your_key_here"

# Optional — required only for write operations (ack, resolve, note)
export PAGERDUTY_EMAIL="[email protected]"
```

### 3. Verify

Ask your agent: **"What's firing on PagerDuty?"**

## Subcommands Reference

### Read Operations (always safe, no confirmation needed)

---

#### `incidents` — List Active Incidents

Lists all triggered and acknowledged incidents, sorted by urgency.

```
pager-triage incidents
```

**Example output:**
```json
{
  "tool": "pd_incidents",
  "provider": "pagerduty",
  "timestamp": "2026-02-16T03:45:00Z",
  "total_incidents": 3,
  "incidents": [
    {
      "id": "P123ABC",
      "incident_number": 4521,
      "title": "High CPU on prod-web-03",
      "status": "triggered",
      "urgency": "high",
      "service": { "id": "PSVC123", "name": "Production Web" },
      "created_at": "2026-02-16T03:00:00Z",
      "assignments": [{ "name": "Jane Doe", "email": "..." }],
      "alert_count": 3,
      "escalation_level": 1,
      "last_status_change": "2026-02-16T03:05:00Z"
    }
  ],
  "summary": "3 active incident(s): high (triggered) x1, high (acknowledged) x1, low (triggered) x1"
}
```

---

#### `detail <incident_id>` — Incident Deep Dive

Full incident details including timeline (log entries), related alerts, notes, and automated analysis.

```
pager-triage detail P123ABC
```

**Example output (abbreviated):**
```json
{
  "tool": "pd_incident_detail",
  "incident": {
    "id": "P123ABC",
    "title": "High CPU on prod-web-03",
    "status": "triggered",
    "urgency": "high",
    "service": { "id": "PSVC123", "name": "Production Web" },
    "escalation_policy": { "name": "Production Escalation" },
    "assignments": [{ "name": "Jane Doe", "escalation_level": 1 }]
  },
  "timeline": [
    { "type": "trigger_log_entry", "created_at": "...", "summary": "Incident triggered via Prometheus Alertmanager" },
    { "type": "escalate_log_entry", "created_at": "...", "summary": "Escalated to Jane Doe (Level 1)" }
  ],
  "alerts": [
    { "id": "A456DEF", "severity": "critical", "summary": "CPU > 95% on prod-web-03", "source": "Prometheus Alertmanager" }
  ],
  "notes": [],
  "analysis": {
    "alert_count": 3,
    "escalation_count": 1,
    "acknowledged": false,
    "trigger_source": "Prometheus Alertmanager"
  }
}
```

---

#### `oncall` — On-Call Schedules

Shows who's currently on call across all schedules and escalation policies.

```
pager-triage oncall
```

**Example output:**
```json
{
  "tool": "pd_oncall",
  "oncalls": [
    {
      "user": { "name": "Jane Doe", "email": "[email protected]" },
      "schedule": { "name": "Primary SRE", "id": "PSCHED1" },
      "escalation_policy": { "name": "Production Escalation" },
      "escalation_level": 1,
      "start": "2026-02-15T17:00:00Z",
      "end": "2026-02-16T17:00:00Z"
    }
  ],
  "summary": "2 on-call assignment(s). Primary SRE: Jane Doe. Secondary SRE: Bob Smith."
}
```

---

#### `services` — List Services

Lists all PagerDuty services with their current operational status.

```
pager-triage services
```

**Example output:**
```json
{
  "tool": "pd_services",
  "services": [
    {
      "id": "PSVC123",
      "name": "Production Web",
      "status": "critical",
      "description": "Production web application servers",
      "escalation_policy": "Production Escalation",
      "integrations": ["Prometheus Alertmanager", "CloudWatch"]
    }
  ],
  "summary": "12 services: 1 critical, 1 warning, 10 active, 0 disabled"
}
```

---

#### `recent` — Recent Incident History

Shows recent incidents for a service or across all services with summary statistics.

```
pager-triage recent                          # All services, last 24h
pager-triage recent --service PSVC123        # Specific service
pager-triage recent --since 7d               # Last 7 days
pager-triage recent --service PSVC123 --since 30d
```

**Flags:**
| Flag | Default | Description |
|------|---------|-------------|
| `--service <id>` | all | Filter to a specific PagerDuty service ID |
| `--since <window>` | `24h` | Time window: `24h`, `7d`, or `30d` |

**Example output:**
```json
{
  "tool": "pd_recent",
  "period": "last 24 hours",
  "service": "PSVC123",
  "incidents": [ ... ],
  "stats": {
    "total": 5,
    "by_urgency": { "high": 2, "low": 3 },
    "by_status": { "resolved": 4, "triggered": 1 },
    "mean_time_to_resolve_minutes": 42
  }
}
```

---

### Write Operations (⚠️ require `--confirm`)

These operations modify state in PagerDuty. They **require** the `--confirm` flag AND `PAGERDUTY_EMAIL` to be set. Without `--confirm`, the tool displays what it *would* do and exits.

---

#### `ack <incident_id> --confirm` — Acknowledge Incident

Acknowledges a triggered incident, stopping further escalation.

```
pager-triage ack P123ABC --confirm
```

Without `--confirm`, displays:
```json
{
  "error": "confirmation_required",
  "message": "⚠️ ACKNOWLEDGE INCIDENT — --confirm flag is required to proceed.",
  "incident": { "id": "P123ABC", "title": "High CPU on prod-web-03", "urgency": "high" },
  "hint": "Re-run with --confirm to acknowledge this incident."
}
```

With `--confirm`:
```json
{
  "tool": "pd_incident_ack",
  "incident_id": "P123ABC",
  "status": "acknowledged",
  "acknowledged_at": "2026-02-16T03:46:00Z",
  "acknowledged_by": "[email protected]"
}
```

---

#### `resolve <incident_id> --confirm` — Resolve Incident

Resolves an incident, marking it as fixed.

```
pager-triage resolve P123ABC --confirm
```

Same confirmation pattern as `ack`. Without `--confirm`, shows incident details and exits. With `--confirm`, resolves and returns confirmation JSON.

---

#### `note <incident_id> --content "text" --confirm` — Add Incident Note

Adds a permanent note to an incident's timeline.

```
pager-triage note P123ABC --content "Root cause: memory leak in auth-service v2.14.3. Rolling back." --confirm
```

**Flags:**
| Flag | Required | Description |
|------|----------|-------------|
| `--content <text>` | Yes | The note text to add |
| `--confirm` | Yes | Confirmation gate |

**Example output:**
```json
{
  "tool": "pd_incident_note",
  "incident_id": "P123ABC",
  "note_id": "PNOTE456",
  "content": "Root cause: memory leak in auth-service v2.14.3. Rolling back.",
  "created_at": "2026-02-16T04:00:00Z",
  "user": "Jane Doe"
}
```

---

## Incident Triage Workflow

When the user needs help triaging an incident, follow this workflow:

### Step 1: Assess the Situation
```
→ pager-triage incidents
```
List all active incidents. Prioritize by urgency (high first) and duration (oldest first).

### Step 2: Deep-Dive the Critical One
```
→ pager-triage detail <incident_id>
```
Get full timeline, alerts, and notes. Identify the trigger source and escalation history.

### Step 3: Correlate with Other Skills
If available, use companion skills to investigate root cause:
- **prom-query** → Query Prometheus for the underlying metrics (CPU, memory, latency, error rate)
- **kube-medic** → Check pod health, restarts, OOMKills, node status in Kubernetes
- **log-dive** → Search application logs for errors around the incident timeframe

### Step 4: Act
```
→ pager-triage ack <incident_id> --confirm        # Stop escalation while investigating
→ pager-triage note <incident_id> --content "..." --confirm   # Document findings
→ pager-triage resolve <incident_id> --confirm     # Mark as fixed
```

### Agent Guidance
- When the user says "what's wrong?" → start with `incidents`
- When they mention a specific incident → use `detail`
- When triaging → show incidents first, then detail on the most urgent
- When correlating → suggest prom-query / kube-medic if installed
- **ALWAYS** show the confirmation preview before executing write operations
- **NEVER** ack/resolve without the user explicitly asking to do so

### Discord v2 Delivery Mode (OpenClaw v2026.2.14+)

When the conversation is happening in a Discord channel:

- Send a compact first response (active incident count, highest urgency incident, recommended next step), then ask if the user wants full detail.
- Keep the first response under ~1200 characters and avoid full timeline dumps in the first message.
- If Discord components are available, include quick actions:
  - `Deep Dive Incident`
  - `Acknowledge Incident`
  - `Add Incident Note`
- If components are not available, provide the same follow-ups as a numbered list.
- Prefer short follow-up chunks (<=15 lines per message) for incident timelines and alert lists.

## Security Notes

- **API keys** are read from environment variables only — never logged, displayed, or included in output
- **Read-only by default** — 5 read commands work with any API key; 3 write commands require `--confirm`
- **Confirmation gates** — Write operations show full incident context and refuse to proceed without `--confirm`
- We recommend starting with a **read-only PagerDuty API key** for triage workflows
- See [SECURITY.md](./SECURITY.md) for the full threat model and RBAC recommendations

## OpsGenie Support (Planned)

OpsGenie integration is planned for a future release. When `OPSGENIE_API_KEY` is set, the same subcommands will map to OpsGenie's REST API with normalized output schemas.

---

<sub>Powered by [Anvil AI](https://anvil-ai.io) · Built for the engineer who gets paged at 3am</sub>

Overview

This skill provides AI-powered incident triage for PagerDuty, letting you list active incidents, deep-dive into timelines and correlated alerts, check on-call schedules, and perform acknowledge/resolve/note actions. It is read-only by default; any write operation requires an explicit --confirm and an operator email. The skill is optimized for quick situational awareness and safe, auditable actions during incident response.

How this skill works

The skill queries the PagerDuty REST API to enumerate incidents, services, on-call schedules, and recent incident history. For a selected incident it returns full details: timeline entries, related alerts, notes, and automated analysis like alert counts and trigger source. Write operations present a preview and refuse to run unless --confirm is provided and PAGERDUTY_EMAIL is set.

When to use it

  • Ask "what's firing?", "any incidents?", or "what's on fire?" to get active incidents.
  • Investigate a specific incident with a deep timeline and correlated alerts using detail <incident_id>.
  • Check who is currently on call across schedules and escalation policies with oncall.
  • Review recent incident trends for a service or time window with recent and services.
  • Acknowledge, resolve, or add a timeline note when you are authorized and ready to act (requires --confirm).

Best practices

  • Start with pager-triage incidents and prioritize by urgency and duration before acting.
  • Use detail <incident_id> to gather timeline, alerts, and automated analysis before ack/resolve.
  • Keep the API key read-only for routine triage; only use full-access keys when necessary.
  • Write operations always require --confirm and PAGERDUTY_EMAIL; preview the action first.
  • Correlate with metrics and logs (prom-query, kube-medic, log-dive) for root cause before resolving.

Example use cases

  • Triage overnight alerts: list active incidents, deep-dive the highest-urgency incident, and gather evidence.
  • On-call handoff: export current on-call assignments and summarize active incidents for the next shift.
  • Fast mitigation: acknowledge an incident to stop escalation while engineers investigate (with --confirm).
  • Postmortem prep: pull recent incidents for a service and export stats like mean time to resolve.
  • Document findings inline: add a timeline note with root cause or mitigation steps for auditability.

FAQ

Are read operations safe to run?

Yes. All listing and detail commands are read-only and do not modify PagerDuty state.

What do I need to perform write actions like ack or resolve?

You must set PAGERDUTY_EMAIL and include the --confirm flag. Without --confirm the skill only shows a preview and will not change anything.