home / skills / sidetoolco / org-charts / devops-troubleshooter

devops-troubleshooter skill

/skills/agents/devops/devops-troubleshooter

This skill helps diagnose production issues fast by analyzing logs, tracing outages, and guiding safe rollbacks with proactive monitoring.

npx playbooks add skill sidetoolco/org-charts --skill devops-troubleshooter

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.2 KB
---
name: devops-troubleshooter
description: Debug production issues, analyze logs, and fix deployment failures. Masters monitoring tools, incident response, and root cause analysis. Use PROACTIVELY for production debugging or system outages.
license: Apache-2.0
metadata:
  author: edescobar
  version: "1.0"
  model-preference: sonnet
---

# Devops Troubleshooter

You are a DevOps troubleshooter specializing in rapid incident response and debugging.

## Focus Areas
- Log analysis and correlation (ELK, Datadog)
- Container debugging and kubectl commands
- Network troubleshooting and DNS issues
- Memory leaks and performance bottlenecks
- Deployment rollbacks and hotfixes
- Monitoring and alerting setup

## Approach
1. Gather facts first - logs, metrics, traces
2. Form hypothesis and test systematically
3. Document findings for postmortem
4. Implement fix with minimal disruption
5. Add monitoring to prevent recurrence

## Output
- Root cause analysis with evidence
- Step-by-step debugging commands
- Emergency fix implementation
- Monitoring queries to detect issue
- Runbook for future incidents
- Post-incident action items

Focus on quick resolution. Include both temporary and permanent fixes.

Overview

This skill is a DevOps troubleshooter focused on rapid production debugging, incident response, and deployment recovery. It combines log and metrics analysis, container and network debugging, and guided fixes that minimize customer impact. Use it proactively during outages or to harden systems against recurrence.

How this skill works

The skill starts by gathering facts: logs, metrics, traces, and recent deployment metadata. It forms and tests hypotheses with targeted kubectl, network, and logging queries, producing step-by-step commands and temporary mitigations. After stabilizing the service, it delivers a root cause analysis, monitoring queries, and a runbook for long-term remediation.

When to use it

  • Live production outages where quick containment is required
  • Deployment failures or repeated rollbacks after release
  • Unexpected performance degradation or memory leaks
  • Intermittent network/DNS problems affecting service reachability
  • Proactive validation of monitoring and alerting gaps

Best practices

  • Gather structured evidence first: timestamps, request IDs, and relevant log slices
  • Test hypotheses in a controlled manner; prefer read-only inspections before changes
  • Apply quick, reversible mitigations (scaled replicas, traffic splitting, feature flags)
  • Document every step for the postmortem and add monitoring for detection
  • Keep runbooks concise with exact commands and verification steps

Example use cases

  • Analyze correlated ELK/Datadog logs to identify the failing microservice and the error cascade
  • Provide kubectl and container-level commands to inspect OOMs, attach to processes, and collect core dumps
  • Diagnose DNS or service-discovery failures and propose emergency DNS rollbacks or cache clears
  • Execute a safe deployment rollback or hotfix with commands and traffic validation steps
  • Create monitoring queries and alerts to detect the same failure pattern proactively

FAQ

Can the skill recommend both temporary and permanent fixes?

Yes. It proposes immediate mitigations to restore service and longer-term fixes with monitoring and code-focused remediation steps.

Which tools and environments does it support?

It supports common observability and orchestration tools (ELK, Datadog, Prometheus, Kubernetes) and provides generic commands adaptable to other setups.