home / skills / jeremylongshore / claude-code-plugins-plus-skills / posthog-incident-runbook

posthog-incident-runbook skill

safe

/plugins/saas-packs/posthog-pack/skills/posthog-incident-runbook

This skill guides and automates PostHog incident response, triage, and postmortems to reduce downtime and speed recovery.

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill posthog-incident-runbook

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.2 KB

---
name: posthog-incident-runbook
description: |
  Execute PostHog incident response procedures with triage, mitigation, and postmortem.
  Use when responding to PostHog-related outages, investigating errors,
  or running post-incident reviews for PostHog integration failures.
  Trigger with phrases like "posthog incident", "posthog outage",
  "posthog down", "posthog on-call", "posthog emergency", "posthog broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# PostHog Incident Runbook

## Overview
Rapid incident response procedures for PostHog-related outages.

## Prerequisites
- Access to PostHog dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | PostHog API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |

## Quick Triage

```bash
# 1. Check PostHog status
curl -s https://status.posthog.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.posthog'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20
```

## Decision Tree

```
PostHog API returning errors?
├─ YES: Is status.posthog.com showing incident?
│   ├─ YES → Wait for PostHog to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
```

## Immediate Actions by Error Type

### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in PostHog dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/posthog-integration
```

### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.posthog.com 2>&1 | grep -i rate

# Enable request queuing
kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue

# Long-term: Contact PostHog for limit increase
```

### 500/503 - PostHog Errors
```bash
# Enable graceful degradation
kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true

# Notify users of degraded service
# Update status page

# Monitor PostHog status for resolution
```

## Communication Templates

### Internal (Slack)
```
🔴 P1 INCIDENT: PostHog Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
```

### External (Status Page)
```
PostHog Integration Issue

We're experiencing issues with our PostHog integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
```

## Post-Incident

### Evidence Collection
```bash
# Generate debug bundle
./scripts/posthog-debug-bundle.sh

# Export relevant logs
kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt

# Capture metrics
curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json
```

### Postmortem Template
```markdown
## Incident: PostHog [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Technical explanation]

### Impact
- Users affected: N
- Revenue impact: $X

### Action Items
- [ ] [Preventive measure] - Owner - Due date
```

## Instructions

### Step 1: Quick Triage
Run the triage commands to identify the issue source.

### Step 2: Follow Decision Tree
Determine if the issue is PostHog-side or internal.

### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.

### Step 4: Communicate Status
Update internal and external stakeholders.

## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |

## Examples

### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"
```

## Resources
- [PostHog Status Page](https://status.posthog.com)
- [PostHog Support](https://support.posthog.com)

## Next Steps
For data handling, see `posthog-data-handling`.

Overview

This skill executes a compact, actionable incident runbook for PostHog integrations to triage outages, mitigate impact, and produce a post-incident review. It guides responders through severity classification, quick checks, targeted remediations, and communication templates. The goal is to resolve PostHog-related failures fast and collect evidence for a clear postmortem.

How this skill works

The skill runs a series of diagnostic steps: check PostHog status, verify integration health, inspect recent error rates and logs, and determine whether the fault is upstream (PostHog) or internal. It maps common error codes (401/403, 429, 500/503) to concrete remediation commands and environment changes, and provides communication snippets for internal and external updates. After resolution it collects logs, metrics, and a postmortem template to capture root cause and action items.

When to use it

Responding to a suspected PostHog outage or integration failure
Investigating spikes in PostHog-related errors or latency
When PostHog API calls return 4xx/5xx responses
During on-call rotation for monitoring PostHog integrations
Preparing a post-incident review after a degraded or failed integration

Best practices

Start with severity classification (P1–P4) to drive response cadence
Run the quick triage commands immediately to gather status, health, metrics, and recent logs
Follow the decision tree to determine whether to wait for PostHog or remediate locally
Use the provided remediation snippets (rotate API key, enable fallback/queue modes) before scaling changes
Collect debug bundles, logs, and metrics for the postmortem and assign clear owners for action items

Example use cases

PostHog API unreachable for all users (P1): run status check, enable fallback, notify stakeholders
High error rate from PostHog (P2): inspect rate metrics, switch to queued requests, request PostHog limit increase
Authentication failures (401/403): verify secret, check for key rotation, update secret and restart deployment
Intermittent 500/503s: enable graceful degradation, update status page, monitor until PostHog resolves
Post-incident review: gather debug bundle, export logs and metrics, complete postmortem template

FAQ

What if status.posthog.com is down or unreachable?

Use a mobile network or VPN to confirm PostHog status. If you cannot reach the status page, proceed with local triage steps and apply safe mitigations (fallback/queue) while investigating network connectivity.

When should I contact PostHog support?

Contact PostHog if errors persist after local remediation, if you are consistently rate limited (429), or if PostHog status indicates a service incident that affects your integration.

How do I handle secret rotation failures?

If secret updates fail due to permissions, re-authenticate or escalate to the platform admin. Validate the new key in the PostHog dashboard before rolling it into production.