home / skills / jeremylongshore / claude-code-plugins-plus-skills / vastai-incident-runbook

This skill guides rapid Vast.ai incident response with triage, remediation, and postmortem steps to minimize downtime.

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill vastai-incident-runbook

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.2 KB
---
name: vastai-incident-runbook
description: |
  Execute Vast.ai incident response procedures with triage, mitigation, and postmortem.
  Use when responding to Vast.ai-related outages, investigating errors,
  or running post-incident reviews for Vast.ai integration failures.
  Trigger with phrases like "vastai incident", "vastai outage",
  "vastai down", "vastai on-call", "vastai emergency", "vastai broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# Vast.ai Incident Runbook

## Overview
Rapid incident response procedures for Vast.ai-related outages.

## Prerequisites
- Access to Vast.ai dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Vast.ai API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |

## Quick Triage

```bash
# 1. Check Vast.ai status
curl -s https://status.vastai.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.vastai'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(vastai_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=vastai-integration --since=5m | grep -i error | tail -20
```

## Decision Tree

```
Vast.ai API returning errors?
├─ YES: Is status.vastai.com showing incident?
│   ├─ YES → Wait for Vast.ai to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
```

## Immediate Actions by Error Type

### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret vastai-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in Vast.ai dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic vastai-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/vastai-integration
```

### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.vastai.com 2>&1 | grep -i rate

# Enable request queuing
kubectl set env deployment/vastai-integration RATE_LIMIT_MODE=queue

# Long-term: Contact Vast.ai for limit increase
```

### 500/503 - Vast.ai Errors
```bash
# Enable graceful degradation
kubectl set env deployment/vastai-integration VASTAI_FALLBACK=true

# Notify users of degraded service
# Update status page

# Monitor Vast.ai status for resolution
```

## Communication Templates

### Internal (Slack)
```
🔴 P1 INCIDENT: Vast.ai Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
```

### External (Status Page)
```
Vast.ai Integration Issue

We're experiencing issues with our Vast.ai integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
```

## Post-Incident

### Evidence Collection
```bash
# Generate debug bundle
./scripts/vastai-debug-bundle.sh

# Export relevant logs
kubectl logs -l app=vastai-integration --since=1h > incident-logs.txt

# Capture metrics
curl "localhost:9090/api/v1/query_range?query=vastai_errors_total&start=2h" > metrics.json
```

### Postmortem Template
```markdown
## Incident: Vast.ai [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Technical explanation]

### Impact
- Users affected: N
- Revenue impact: $X

### Action Items
- [ ] [Preventive measure] - Owner - Due date
```

## Instructions

### Step 1: Quick Triage
Run the triage commands to identify the issue source.

### Step 2: Follow Decision Tree
Determine if the issue is Vast.ai-side or internal.

### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.

### Step 4: Communicate Status
Update internal and external stakeholders.

## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |

## Examples

### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.vastai.status' || echo "UNHEALTHY"
```

## Resources
- [Vast.ai Status Page](https://status.vastai.com)
- [Vast.ai Support](https://support.vastai.com)

## Next Steps
For data handling, see `vastai-data-handling`.

Overview

This skill executes a focused incident runbook for Vast.ai integrations, guiding triage, mitigation, communication, and postmortem tasks. It standardizes response for outages, authentication failures, rate limits, and service errors to reduce Mean Time to Resolution. Use the runbook to ensure consistent actions, evidence collection, and stakeholder updates.

How this skill works

The skill walks responders through quick triage commands to check Vast.ai status, integration health, error rates, and recent logs. It applies a decision tree to determine whether the problem is on Vast.ai's side or internal, then suggests immediate remediations per error type (401/403, 429, 500/503). It also provides communication templates, evidence collection commands, and a postmortem template to capture impact and action items.

When to use it

  • Vast.ai API is unreachable or returning errors
  • High latency, elevated error rates, or partial failures involving Vast.ai
  • Authentication failures or suspected secret rotation for Vast.ai keys
  • Rate limiting impacting throughput to Vast.ai
  • After resolving an incident to collect evidence and run a postmortem

Best practices

  • Start with the one-line health check and status.vastai.com before changing infrastructure
  • Classify severity (P1–P4) quickly and assign an incident commander for P1/P2 events
  • Follow decision tree: confirm Vast.ai-side issues before rolling back or changing local configs
  • Use environment flags (fallback, rate limit queue) to enable graceful degradation instead of immediate rollback
  • Collect logs, metrics, and a debug bundle immediately for the postmortem

Example use cases

  • Vast.ai API returns 500 errors across many requests — enable VASTAI_FALLBACK and monitor status.vastai.com
  • Unexpected 401 failures after a deploy — verify and rotate the API key secret, then restart the integration deployment
  • Spike in vastai_errors_total metric — set RATE_LIMIT_MODE=queue and contact Vast.ai to request limit increases
  • Intermittent webhook delays — run quick triage, capture recent pod logs, update status page and schedule a postmortem

FAQ

What if status.vastai.com shows no incident but API calls fail?

Treat it as an internal integration issue: check credentials, secrets, pod health, and network. Collect logs and restart the integration after updating misconfigured secrets.

When should I enable graceful degradation?

Enable graceful degradation (VASTAI_FALLBACK=true) for widespread 500/503 errors or when Vast.ai is confirmed degraded to preserve partial functionality and reduce user impact.