home / skills / jeremylongshore / claude-code-plugins-plus-skills / posthog-incident-runbook

This skill guides and automates PostHog incident response, triage, and postmortems to reduce downtime and speed recovery.

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill posthog-incident-runbook

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.2 KB
---
name: posthog-incident-runbook
description: |
  Execute PostHog incident response procedures with triage, mitigation, and postmortem.
  Use when responding to PostHog-related outages, investigating errors,
  or running post-incident reviews for PostHog integration failures.
  Trigger with phrases like "posthog incident", "posthog outage",
  "posthog down", "posthog on-call", "posthog emergency", "posthog broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# PostHog Incident Runbook

## Overview
Rapid incident response procedures for PostHog-related outages.

## Prerequisites
- Access to PostHog dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | PostHog API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |

## Quick Triage

```bash
# 1. Check PostHog status
curl -s https://status.posthog.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.posthog'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20
```

## Decision Tree

```
PostHog API returning errors?
├─ YES: Is status.posthog.com showing incident?
│   ├─ YES → Wait for PostHog to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
```

## Immediate Actions by Error Type

### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in PostHog dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/posthog-integration
```

### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.posthog.com 2>&1 | grep -i rate

# Enable request queuing
kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue

# Long-term: Contact PostHog for limit increase
```

### 500/503 - PostHog Errors
```bash
# Enable graceful degradation
kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true

# Notify users of degraded service
# Update status page

# Monitor PostHog status for resolution
```

## Communication Templates

### Internal (Slack)
```
🔴 P1 INCIDENT: PostHog Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
```

### External (Status Page)
```
PostHog Integration Issue

We're experiencing issues with our PostHog integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
```

## Post-Incident

### Evidence Collection
```bash
# Generate debug bundle
./scripts/posthog-debug-bundle.sh

# Export relevant logs
kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt

# Capture metrics
curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json
```

### Postmortem Template
```markdown
## Incident: PostHog [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Technical explanation]

### Impact
- Users affected: N
- Revenue impact: $X

### Action Items
- [ ] [Preventive measure] - Owner - Due date
```

## Instructions

### Step 1: Quick Triage
Run the triage commands to identify the issue source.

### Step 2: Follow Decision Tree
Determine if the issue is PostHog-side or internal.

### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.

### Step 4: Communicate Status
Update internal and external stakeholders.

## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |

## Examples

### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"
```

## Resources
- [PostHog Status Page](https://status.posthog.com)
- [PostHog Support](https://support.posthog.com)

## Next Steps
For data handling, see `posthog-data-handling`.

Overview

This skill executes a compact, actionable incident runbook for PostHog integrations to triage outages, mitigate impact, and produce a post-incident review. It guides responders through severity classification, quick checks, targeted remediations, and communication templates. The goal is to resolve PostHog-related failures fast and collect evidence for a clear postmortem.

How this skill works

The skill runs a series of diagnostic steps: check PostHog status, verify integration health, inspect recent error rates and logs, and determine whether the fault is upstream (PostHog) or internal. It maps common error codes (401/403, 429, 500/503) to concrete remediation commands and environment changes, and provides communication snippets for internal and external updates. After resolution it collects logs, metrics, and a postmortem template to capture root cause and action items.

When to use it

  • Responding to a suspected PostHog outage or integration failure
  • Investigating spikes in PostHog-related errors or latency
  • When PostHog API calls return 4xx/5xx responses
  • During on-call rotation for monitoring PostHog integrations
  • Preparing a post-incident review after a degraded or failed integration

Best practices

  • Start with severity classification (P1–P4) to drive response cadence
  • Run the quick triage commands immediately to gather status, health, metrics, and recent logs
  • Follow the decision tree to determine whether to wait for PostHog or remediate locally
  • Use the provided remediation snippets (rotate API key, enable fallback/queue modes) before scaling changes
  • Collect debug bundles, logs, and metrics for the postmortem and assign clear owners for action items

Example use cases

  • PostHog API unreachable for all users (P1): run status check, enable fallback, notify stakeholders
  • High error rate from PostHog (P2): inspect rate metrics, switch to queued requests, request PostHog limit increase
  • Authentication failures (401/403): verify secret, check for key rotation, update secret and restart deployment
  • Intermittent 500/503s: enable graceful degradation, update status page, monitor until PostHog resolves
  • Post-incident review: gather debug bundle, export logs and metrics, complete postmortem template

FAQ

What if status.posthog.com is down or unreachable?

Use a mobile network or VPN to confirm PostHog status. If you cannot reach the status page, proceed with local triage steps and apply safe mitigations (fallback/queue) while investigating network connectivity.

When should I contact PostHog support?

Contact PostHog if errors persist after local remediation, if you are consistently rate limited (429), or if PostHog status indicates a service incident that affects your integration.

How do I handle secret rotation failures?

If secret updates fail due to permissions, re-authenticate or escalate to the platform admin. Validate the new key in the PostHog dashboard before rolling it into production.