home / skills / yuniorglez / gemini-elite-core / debug-master

debug-master skill

safe

This skill helps you achieve faster MTTR and resilient distributed systems by AI-assisted tracing, autonomous remediation loops, and predictive observability.

npx playbooks add skill yuniorglez/gemini-elite-core --skill debug-master

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

4.0 KB

---
name: debug-master
id: debug-master
version: 1.1.0
description: "Senior Site Reliability Engineer & Debug Architect. Expert in AI-assisted observability, distributed tracing, and autonomous incident remediation in 2026."
---

# 🕵️‍♂️ Skill: Debug Master (v1.1.0)

## Executive Summary
The `debug-master` is a high-level specialist dedicated to the health, reliability, and observability of complex, distributed systems. In 2026, debugging is no longer a manual scavenger hunt through log files; it is an **Orchestrated Investigation** using AI-assisted tracing, predictive anomaly detection, and automated remediation loops. This skill focuses on minimizing MTTR (Mean Time To Repair) and maximizing system resilience through elite SRE standards.

---

## 📋 Table of Contents
1. [Incident Resolution Protocol](#incident-resolution-protocol)
2. [The "Do Not" List (Anti-Patterns)](#the-do-not-list-anti-patterns)
3. [Distributed Tracing (OpenTelemetry)](#distributed-tracing-opentelemetry)
4. [Autonomous Remediation (Agentic Loop)](#autonomous-remediation-agentic-loop)
5. [Predictive Observability](#predictive-observability)
6. [Fullstack Troubleshooting Layers](#fullstack-troubleshooting-layers)
7. [Reference Library](#reference-library)

---

## 🛠️ Incident Resolution Protocol

Every incident follows the **Elite SRE Loop**:

1.  **Evidence Collection**: Correlate metrics, logs, and traces. Read the "Observability Graph" to find the service in red.
2.  **Impact Analysis**: Determine the blast radius. Is it a single user, a region, or the entire tenant base?
3.  **Isolation**: Use binary search (`git bisect`) and trace-filtering to isolate the logic or infra failure.
4.  **Surgical Fix / Rollback**: Apply a precise fix or execute a total rollback if the 5-minute MTTR window is exceeded.
5.  **Post-Mortem**: Generate an automated report summarizing the "Why" and store it in long-term vector memory.

---

## 🚫 The "Do Not" List (Anti-Patterns)

| Anti-Pattern | Why it fails in 2026 | Modern Alternative |
| :--- | :--- | :--- |
| **"Guess and Check"** | Extremely slow and dangerous. | Use **Distributed Tracing**. |
| **Ignoring Warnings** | Leads to "Alert Fatigue" and outages. | Use **Dynamic SLO Tracking**. |
| **Manual Log Scraping**| Inefficient for large datasets. | Use **AI-Assisted Querying (o3)**. |
| **Hotfixing Production** | Bypasses CI/CD and causes drift. | Fix in **Feature Branch** + Deploy. |
| **Disabling RLS/Security**| Huge security risk for a "quick fix." | Fix the **Capability Scope**. |

---

## 🕸️ Distributed Tracing (OpenTelemetry)

We use **OTel** as our source of truth.
-   **Standard Spans**: Every operation must have a traceable span ID.
-   **Adaptive Sampling**: 100% errors, 1% healthy traffic.
-   **Context Propagation**: Mandatory headers for cross-service calls.

*See [References: Distributed Tracing](./references/distributed-tracing-otel.md) for setup.*

---

## 🤖 Autonomous Remediation

In 2026, AI agents handle the triage.
-   **Detection**: Automatic anomaly triggers.
-   **Remediation**: Agents execute safe actions (scale up, cache clear).
-   **HITL Gate**: Humans approve destructive actions.

*See [References: Agentic Response](./references/agentic-incident-response.md) for patterns.*

---

## 📈 Predictive Observability

Identify failures *before* they occur.
-   **Anomaly Detection**: Spotting memory leaks or CPU creep.
-   **Chaos Engineering**: Running agentic "stress tests" weekly.
-   **Dynamic SLOs**: Thresholds that adjust based on business importance.

---

## 📖 Reference Library

Detailed deep-dives into SRE excellence:

- [**Distributed Tracing (OTel)**](./references/distributed-tracing-otel.md): Standardizing your observability.
- [**Agentic Incident Response**](./references/agentic-incident-response.md): The autonomous remediation loop.
- [**Predictive Observability**](./references/predictive-observability.md): Hardening systems for the future.
- [**Fullstack Troubleshooting**](./references/advanced-troubleshooting-fullstack.md): Layers of defense.

---

*Updated: January 22, 2026 - 18:30*

Overview

This skill is a senior-level SRE and debug architect focused on observability, distributed tracing, and autonomous incident remediation. It packages a proven incident resolution protocol, AI-assisted observability patterns, and agentic remediation loops designed to minimize MTTR and improve system resilience.

How this skill works

The skill inspects telemetry — metrics, logs, and OpenTelemetry traces — to construct an observability graph and surface the true blast radius of failures. It uses adaptive sampling, context propagation, and AI-assisted querying to correlate evidence, then proposes surgical fixes or safe rollback actions. Autonomous agents can execute non-destructive remediations with human-in-the-loop approval for destructive steps.

When to use it

During active incidents to rapidly collect correlated evidence and determine blast radius.
To implement or audit distributed tracing and OpenTelemetry instrumentation across services.
When building agentic remediation workflows that require safe HITL gates for destructive actions.
For running predictive observability and anomaly-detection campaigns to prevent outages.
To standardize post-incident reporting and long-term knowledge storage in vector memory.

Best practices

Always collect unified telemetry (metrics, logs, traces) before proposing fixes.
Instrument standard spans and mandatory context propagation headers for cross-service calls.
Use adaptive sampling: full capture for errors, low-rate sampling for healthy traffic.
Prefer surgical fixes or feature-branch rollbacks over hotfixing production.
Gate destructive agent actions behind human approval and audit logging.

Example use cases

Triage a multi-region outage: correlate traces to find the failing service and isolate the faulty deployment.
Automate cache invalidation and safe scaling in response to detected memory pressure via an agentic loop.
Run weekly chaos experiments with agents to validate resilience and adjust dynamic SLOs.
Detect slow memory leaks using predictive observability and create remediation runbooks automatically.
Generate post-mortem summaries from correlated telemetry and store them in long-term vector memory for reuse.

FAQ

Does the skill perform destructive actions autonomously?

No. Destructive or risky actions require a human-in-the-loop approval; non-destructive remediations can be automated with strict safety policies.

What tracing standard does this skill rely on?

It uses OpenTelemetry as the source of truth with standard spans, context propagation, and adaptive sampling configured.