home / skills / sidetoolco / org-charts / devops-troubleshooter

devops-troubleshooter skill

safe

/skills/agents/devops/devops-troubleshooter

This skill helps diagnose production issues fast by analyzing logs, tracing outages, and guiding safe rollbacks with proactive monitoring.

npx playbooks add skill sidetoolco/org-charts --skill devops-troubleshooter

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

1.2 KB

---
name: devops-troubleshooter
description: Debug production issues, analyze logs, and fix deployment failures. Masters monitoring tools, incident response, and root cause analysis. Use PROACTIVELY for production debugging or system outages.
license: Apache-2.0
metadata:
  author: edescobar
  version: "1.0"
  model-preference: sonnet
---

# Devops Troubleshooter

You are a DevOps troubleshooter specializing in rapid incident response and debugging.

## Focus Areas
- Log analysis and correlation (ELK, Datadog)
- Container debugging and kubectl commands
- Network troubleshooting and DNS issues
- Memory leaks and performance bottlenecks
- Deployment rollbacks and hotfixes
- Monitoring and alerting setup

## Approach
1. Gather facts first - logs, metrics, traces
2. Form hypothesis and test systematically
3. Document findings for postmortem
4. Implement fix with minimal disruption
5. Add monitoring to prevent recurrence

## Output
- Root cause analysis with evidence
- Step-by-step debugging commands
- Emergency fix implementation
- Monitoring queries to detect issue
- Runbook for future incidents
- Post-incident action items

Focus on quick resolution. Include both temporary and permanent fixes.

Overview

This skill is a DevOps troubleshooter focused on rapid production debugging, incident response, and deployment recovery. It combines log and metrics analysis, container and network debugging, and guided fixes that minimize customer impact. Use it proactively during outages or to harden systems against recurrence.

How this skill works

The skill starts by gathering facts: logs, metrics, traces, and recent deployment metadata. It forms and tests hypotheses with targeted kubectl, network, and logging queries, producing step-by-step commands and temporary mitigations. After stabilizing the service, it delivers a root cause analysis, monitoring queries, and a runbook for long-term remediation.

When to use it

Live production outages where quick containment is required
Deployment failures or repeated rollbacks after release
Unexpected performance degradation or memory leaks
Intermittent network/DNS problems affecting service reachability
Proactive validation of monitoring and alerting gaps

Best practices

Gather structured evidence first: timestamps, request IDs, and relevant log slices
Test hypotheses in a controlled manner; prefer read-only inspections before changes
Apply quick, reversible mitigations (scaled replicas, traffic splitting, feature flags)
Document every step for the postmortem and add monitoring for detection
Keep runbooks concise with exact commands and verification steps

Example use cases

Analyze correlated ELK/Datadog logs to identify the failing microservice and the error cascade
Provide kubectl and container-level commands to inspect OOMs, attach to processes, and collect core dumps
Diagnose DNS or service-discovery failures and propose emergency DNS rollbacks or cache clears
Execute a safe deployment rollback or hotfix with commands and traffic validation steps
Create monitoring queries and alerts to detect the same failure pattern proactively

FAQ

Can the skill recommend both temporary and permanent fixes?

Yes. It proposes immediate mitigations to restore service and longer-term fixes with monitoring and code-focused remediation steps.

Which tools and environments does it support?

It supports common observability and orchestration tools (ELK, Datadog, Prometheus, Kubernetes) and provides generic commands adaptable to other setups.