home / skills / jeremylongshore / claude-code-plugins-plus-skills / instantly-incident-runbook
/plugins/saas-packs/instantly-pack/skills/instantly-incident-runbook
This skill guides rapid Instantly incident response with triage, remediation, and postmortem steps to minimize downtime and accelerate resolution.
npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill instantly-incident-runbookReview the files below or copy the command above to add this skill to your agents.
---
name: instantly-incident-runbook
description: |
Execute Instantly incident response procedures with triage, mitigation, and postmortem.
Use when responding to Instantly-related outages, investigating errors,
or running post-incident reviews for Instantly integration failures.
Trigger with phrases like "instantly incident", "instantly outage",
"instantly down", "instantly on-call", "instantly emergency", "instantly broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---
# Instantly Incident Runbook
## Overview
Rapid incident response procedures for Instantly-related outages.
## Prerequisites
- Access to Instantly dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
## Severity Levels
| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Instantly API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |
## Quick Triage
```bash
# 1. Check Instantly status
curl -s https://status.instantly.com | jq
# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.instantly'
# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])
# 4. Recent error logs
kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20
```
## Decision Tree
```
Instantly API returning errors?
├─ YES: Is status.instantly.com showing incident?
│ ├─ YES → Wait for Instantly to resolve. Enable fallback.
│ └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
├─ YES → Likely resolved or intermittent. Monitor.
└─ NO → Our infrastructure issue. Check pods, memory, network.
```
## Immediate Actions by Error Type
### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d
# Check if key was rotated
# → Verify in Instantly dashboard
# Remediation: Update secret and restart pods
kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/instantly-integration
```
### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.instantly.com 2>&1 | grep -i rate
# Enable request queuing
kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue
# Long-term: Contact Instantly for limit increase
```
### 500/503 - Instantly Errors
```bash
# Enable graceful degradation
kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true
# Notify users of degraded service
# Update status page
# Monitor Instantly status for resolution
```
## Communication Templates
### Internal (Slack)
```
🔴 P1 INCIDENT: Instantly Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
```
### External (Status Page)
```
Instantly Integration Issue
We're experiencing issues with our Instantly integration.
Some users may experience [specific impact].
We're actively investigating and will provide updates.
Last updated: [timestamp]
```
## Post-Incident
### Evidence Collection
```bash
# Generate debug bundle
./scripts/instantly-debug-bundle.sh
# Export relevant logs
kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt
# Capture metrics
curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json
```
### Postmortem Template
```markdown
## Incident: Instantly [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
### Summary
[1-2 sentence description]
### Timeline
- HH:MM - [Event]
- HH:MM - [Event]
### Root Cause
[Technical explanation]
### Impact
- Users affected: N
- Revenue impact: $X
### Action Items
- [ ] [Preventive measure] - Owner - Due date
```
## Instructions
### Step 1: Quick Triage
Run the triage commands to identify the issue source.
### Step 2: Follow Decision Tree
Determine if the issue is Instantly-side or internal.
### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.
### Step 4: Communicate Status
Update internal and external stakeholders.
## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |
## Examples
### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"
```
## Resources
- [Instantly Status Page](https://status.instantly.com)
- [Instantly Support](https://support.instantly.com)
## Next Steps
For data handling, see `instantly-data-handling`.This skill executes a focused incident response runbook for Instantly-related outages, guiding triage, mitigation, communication, and post-incident review. It’s designed for on-call engineers who need a fast, repeatable checklist to restore service and collect evidence for postmortem analysis.
The skill walks you through quick checks (status page, integration health, error rates, recent logs), a decision tree to determine whether the issue is with Instantly or your integration, and targeted remediation steps for common error types (authentication, rate limiting, upstream errors). It also provides ready-to-send communication templates and post-incident evidence collection and postmortem structure.
What if the Instantly status page is down or unreachable?
Use a mobile network or VPN to rule out local network restrictions, and rely on your monitoring and error rates to assess impact while contacting Instantly support.
How do I decide between waiting for Instantly and applying a local fix?
If status.instantly.com shows an incident, enable graceful degradation or fallback and monitor. If their status is clear, prioritize checking keys, configs, and local infra before escalating to Instantly.