home / skills / jeremylongshore / claude-code-plugins-plus-skills / instantly-incident-runbook

instantly-incident-runbook skill

safe

/plugins/saas-packs/instantly-pack/skills/instantly-incident-runbook

This skill guides rapid Instantly incident response with triage, remediation, and postmortem steps to minimize downtime and accelerate resolution.

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill instantly-incident-runbook

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.3 KB

---
name: instantly-incident-runbook
description: |
  Execute Instantly incident response procedures with triage, mitigation, and postmortem.
  Use when responding to Instantly-related outages, investigating errors,
  or running post-incident reviews for Instantly integration failures.
  Trigger with phrases like "instantly incident", "instantly outage",
  "instantly down", "instantly on-call", "instantly emergency", "instantly broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# Instantly Incident Runbook

## Overview
Rapid incident response procedures for Instantly-related outages.

## Prerequisites
- Access to Instantly dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Instantly API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |

## Quick Triage

```bash
# 1. Check Instantly status
curl -s https://status.instantly.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.instantly'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20
```

## Decision Tree

```
Instantly API returning errors?
├─ YES: Is status.instantly.com showing incident?
│   ├─ YES → Wait for Instantly to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
```

## Immediate Actions by Error Type

### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in Instantly dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/instantly-integration
```

### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.instantly.com 2>&1 | grep -i rate

# Enable request queuing
kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue

# Long-term: Contact Instantly for limit increase
```

### 500/503 - Instantly Errors
```bash
# Enable graceful degradation
kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true

# Notify users of degraded service
# Update status page

# Monitor Instantly status for resolution
```

## Communication Templates

### Internal (Slack)
```
🔴 P1 INCIDENT: Instantly Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
```

### External (Status Page)
```
Instantly Integration Issue

We're experiencing issues with our Instantly integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
```

## Post-Incident

### Evidence Collection
```bash
# Generate debug bundle
./scripts/instantly-debug-bundle.sh

# Export relevant logs
kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt

# Capture metrics
curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json
```

### Postmortem Template
```markdown
## Incident: Instantly [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Technical explanation]

### Impact
- Users affected: N
- Revenue impact: $X

### Action Items
- [ ] [Preventive measure] - Owner - Due date
```

## Instructions

### Step 1: Quick Triage
Run the triage commands to identify the issue source.

### Step 2: Follow Decision Tree
Determine if the issue is Instantly-side or internal.

### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.

### Step 4: Communicate Status
Update internal and external stakeholders.

## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |

## Examples

### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"
```

## Resources
- [Instantly Status Page](https://status.instantly.com)
- [Instantly Support](https://support.instantly.com)

## Next Steps
For data handling, see `instantly-data-handling`.

Overview

This skill executes a focused incident response runbook for Instantly-related outages, guiding triage, mitigation, communication, and post-incident review. It’s designed for on-call engineers who need a fast, repeatable checklist to restore service and collect evidence for postmortem analysis.

How this skill works

The skill walks you through quick checks (status page, integration health, error rates, recent logs), a decision tree to determine whether the issue is with Instantly or your integration, and targeted remediation steps for common error types (authentication, rate limiting, upstream errors). It also provides ready-to-send communication templates and post-incident evidence collection and postmortem structure.

When to use it

When Instantly API calls are failing or returning errors from production.
When integration latency or partial failures degrade user experience.
During on-call shifts triggered by alerts related to Instantly metrics.
When investigating unknown errors tied to the Instantly integration.
When preparing a post-incident review after an Instantly-related outage.

Best practices

Run quick triage commands immediately: status page, integration health, recent logs, and error rate queries.
Classify severity (P1–P4) to set response time and communication cadence.
Follow the decision tree: confirm if issue is Instantly-side or internal before applying fixes.
Collect evidence (logs, metrics, debug bundle) early for accurate postmortem analysis.
Use provided communication templates to keep stakeholders informed and avoid conflicting messages.

Example use cases

A P1 alert: Instantly API unreachable — run triage, verify status.instantly.com, enable fallback, and notify users.
High 5-minute error spike: query Prometheus for instantly_errors_total and inspect recent pod logs to identify code or config regressions.
Authentication failures (401/403): verify and rotate API key in Kubernetes secrets and restart deployment.
Rate-limited responses (429): enable request queuing mode and contact Instantly for quota increases.
Post-incident: generate debug bundle, export logs and metrics, then fill the postmortem template for follow-up.

FAQ

What if the Instantly status page is down or unreachable?

Use a mobile network or VPN to rule out local network restrictions, and rely on your monitoring and error rates to assess impact while contacting Instantly support.

How do I decide between waiting for Instantly and applying a local fix?

If status.instantly.com shows an incident, enable graceful degradation or fallback and monitor. If their status is clear, prioritize checking keys, configs, and local infra before escalating to Instantly.