home / skills / jeremylongshore / claude-code-plugins-plus-skills / vastai-incident-runbook
/plugins/saas-packs/vastai-pack/skills/vastai-incident-runbook
This skill guides rapid Vast.ai incident response with triage, remediation, and postmortem steps to minimize downtime.
npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill vastai-incident-runbookReview the files below or copy the command above to add this skill to your agents.
---
name: vastai-incident-runbook
description: |
Execute Vast.ai incident response procedures with triage, mitigation, and postmortem.
Use when responding to Vast.ai-related outages, investigating errors,
or running post-incident reviews for Vast.ai integration failures.
Trigger with phrases like "vastai incident", "vastai outage",
"vastai down", "vastai on-call", "vastai emergency", "vastai broken".
allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*)
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---
# Vast.ai Incident Runbook
## Overview
Rapid incident response procedures for Vast.ai-related outages.
## Prerequisites
- Access to Vast.ai dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
## Severity Levels
| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Vast.ai API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |
## Quick Triage
```bash
# 1. Check Vast.ai status
curl -s https://status.vastai.com | jq
# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.vastai'
# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(vastai_errors_total[5m])
# 4. Recent error logs
kubectl logs -l app=vastai-integration --since=5m | grep -i error | tail -20
```
## Decision Tree
```
Vast.ai API returning errors?
├─ YES: Is status.vastai.com showing incident?
│ ├─ YES → Wait for Vast.ai to resolve. Enable fallback.
│ └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
├─ YES → Likely resolved or intermittent. Monitor.
└─ NO → Our infrastructure issue. Check pods, memory, network.
```
## Immediate Actions by Error Type
### 401/403 - Authentication
```bash
# Verify API key is set
kubectl get secret vastai-secrets -o jsonpath='{.data.api-key}' | base64 -d
# Check if key was rotated
# → Verify in Vast.ai dashboard
# Remediation: Update secret and restart pods
kubectl create secret generic vastai-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/vastai-integration
```
### 429 - Rate Limited
```bash
# Check rate limit headers
curl -v https://api.vastai.com 2>&1 | grep -i rate
# Enable request queuing
kubectl set env deployment/vastai-integration RATE_LIMIT_MODE=queue
# Long-term: Contact Vast.ai for limit increase
```
### 500/503 - Vast.ai Errors
```bash
# Enable graceful degradation
kubectl set env deployment/vastai-integration VASTAI_FALLBACK=true
# Notify users of degraded service
# Update status page
# Monitor Vast.ai status for resolution
```
## Communication Templates
### Internal (Slack)
```
🔴 P1 INCIDENT: Vast.ai Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
```
### External (Status Page)
```
Vast.ai Integration Issue
We're experiencing issues with our Vast.ai integration.
Some users may experience [specific impact].
We're actively investigating and will provide updates.
Last updated: [timestamp]
```
## Post-Incident
### Evidence Collection
```bash
# Generate debug bundle
./scripts/vastai-debug-bundle.sh
# Export relevant logs
kubectl logs -l app=vastai-integration --since=1h > incident-logs.txt
# Capture metrics
curl "localhost:9090/api/v1/query_range?query=vastai_errors_total&start=2h" > metrics.json
```
### Postmortem Template
```markdown
## Incident: Vast.ai [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
### Summary
[1-2 sentence description]
### Timeline
- HH:MM - [Event]
- HH:MM - [Event]
### Root Cause
[Technical explanation]
### Impact
- Users affected: N
- Revenue impact: $X
### Action Items
- [ ] [Preventive measure] - Owner - Due date
```
## Instructions
### Step 1: Quick Triage
Run the triage commands to identify the issue source.
### Step 2: Follow Decision Tree
Determine if the issue is Vast.ai-side or internal.
### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.
### Step 4: Communicate Status
Update internal and external stakeholders.
## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |
## Examples
### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.vastai.status' || echo "UNHEALTHY"
```
## Resources
- [Vast.ai Status Page](https://status.vastai.com)
- [Vast.ai Support](https://support.vastai.com)
## Next Steps
For data handling, see `vastai-data-handling`.This skill executes a focused incident runbook for Vast.ai integrations, guiding triage, mitigation, communication, and postmortem tasks. It standardizes response for outages, authentication failures, rate limits, and service errors to reduce Mean Time to Resolution. Use the runbook to ensure consistent actions, evidence collection, and stakeholder updates.
The skill walks responders through quick triage commands to check Vast.ai status, integration health, error rates, and recent logs. It applies a decision tree to determine whether the problem is on Vast.ai's side or internal, then suggests immediate remediations per error type (401/403, 429, 500/503). It also provides communication templates, evidence collection commands, and a postmortem template to capture impact and action items.
What if status.vastai.com shows no incident but API calls fail?
Treat it as an internal integration issue: check credentials, secrets, pod health, and network. Collect logs and restart the integration after updating misconfigured secrets.
When should I enable graceful degradation?
Enable graceful degradation (VASTAI_FALLBACK=true) for widespread 500/503 errors or when Vast.ai is confirmed degraded to preserve partial functionality and reduce user impact.