home / skills / bacoco / bmad-skills / bmad-observability-readiness

bmad-observability-readiness skill

/.claude/skills/bmad-observability-readiness

This skill delivers a comprehensive observability plan with metrics, logs, traces, dashboards, and runbooks to improve reliability and diagnostics.

npx playbooks add skill bacoco/bmad-skills --skill bmad-observability-readiness

Review the files below or copy the command above to add this skill to your agents.

Files (9)
SKILL.md
3.3 KB
---
name: bmad-observability-readiness
description: Establishes instrumentation, monitoring, and alerting foundations.
allowed-tools: ["Read", "Write", "Grep", "Bash"]
metadata:
  auto-invoke: true
  triggers:
    patterns:
      - "add logging"
      - "monitoring setup"
      - "no telemetry"
      - "instrument this"
      - "observability gaps"
      - "alert fatigue"
      - "SLO dashboard"
    keywords:
      - observability
      - logging
      - monitoring
      - tracing
      - metrics
      - alerting
      - telemetry
  capabilities:
    - instrumentation-design
    - metrics-cataloging
    - logging-standards
    - alert-tuning
    - slo-definition
  prerequisites:
    - bmad-architecture-design
    - bmad-test-strategy
  outputs:
    - observability-plan
    - instrumentation-backlog
    - slo-dashboard-spec
---

# BMAD Observability Readiness Skill

## When to Invoke

Use this skill when the user:
- Mentions missing or low-quality logging, metrics, or tracing.
- Requests monitoring/alerting setup before a launch or major release.
- Needs SLOs, dashboards, or on-call runbooks.
- Reports alert fatigue or noise that needs rationalization.
- Wants to ensure performance and reliability work has data coverage.

If instrumentation already exists and only specific bug fixes are required, hand over to `bmad-development-execution` with the backlog produced here.

## Mission

Deliver a comprehensive observability plan that enables diagnosis, alerting, and measurement across the system. Ensure downstream performance, reliability, and security work has trustworthy telemetry.

## Inputs Required

- Architecture diagrams and component inventory.
- Existing logging/monitoring/tracing configuration (if any).
- Current incidents, outages, or blind spots experienced by the team.
- SLAs/SLOs, business KPIs, or compliance reporting requirements.

## Outputs

- **Observability plan** detailing metrics, logs, traces, dashboards, and retention policies.
- **Instrumentation backlog** with implementation tasks, owners, and acceptance criteria.
- **SLO dashboard specification** covering golden signals, alert thresholds, and runbook links.
- Updated runbook or escalation paths if gaps were discovered.

## Process

1. Audit current telemetry coverage, tooling, and data retention. Document gaps.
2. Define observability objectives aligned with user journeys and business KPIs.
3. Design instrumentation strategy: metrics taxonomy, structured logging, trace spans, event schemas.
4. Establish SLOs, SLIs, and alerting strategy with on-call expectations and noise controls.
5. Produce dashboards/reporting requirements and data governance notes.
6. Create backlog with prioritized instrumentation tasks and verification approach.

## Quality Gates

- Every critical user journey has metrics and alerts defined (latency, errors, saturation, traffic).
- Logging standards specify structure, PII handling, and retention.
- Alert runbooks documented or flagged for creation.
- Observability plan references integration with performance, security, and incident workflows.

## Error Handling

- If telemetry tooling is undecided, present comparative options with trade-offs.
- Highlight dependencies on platform teams or infrastructure before finalizing timeline.
- Escalate when observability requirements conflict with compliance or privacy constraints.

Overview

This skill establishes instrumentation, monitoring, and alerting foundations to make systems observable and measurable. It delivers a concrete observability plan, prioritized instrumentation backlog, and SLO dashboard specifications to enable fast diagnosis and reliable alerting. The focus is on practical outputs that support launches, incident response, and ongoing reliability work.

How this skill works

I audit existing telemetry (logs, metrics, traces), tooling, and retention to document coverage and gaps. I define observability objectives tied to user journeys and business KPIs, then design a concrete instrumentation strategy (metrics taxonomy, structured logs, trace spans). Finally, I produce SLOs, dashboards, runbooks, and a prioritized implementation backlog with acceptance criteria.

When to use it

  • Before a launch or major release to ensure monitoring and alerts are in place
  • When logging, metrics, or tracing are missing, inconsistent, or low quality
  • If you need SLOs, dashboards, or formalized runbooks for on-call teams
  • When alert fatigue or noisy alerts require rationalization
  • When performance or reliability work lacks trustworthy telemetry

Best practices

  • Start with critical user journeys and define SLIs that map to business KPIs
  • Use structured logging and consistent naming for metrics and spans
  • Prioritize golden signals (latency, errors, traffic, saturation) per service
  • Define guardrails for PII and retention to satisfy privacy and compliance
  • Create small, verifiable instrumentation tasks with owners and test criteria

Example use cases

  • Audit a microservices stack to identify telemetry blind spots and propose fixes
  • Create an SLO dashboard and alert thresholds for a customer-facing API
  • Define a metrics taxonomy and logging standard for a new platform
  • Rationalize noisy alerts and produce a runbook and on-call expectations
  • Produce a prioritized backlog to instrument payment or auth flows end-to-end

FAQ

What inputs do I need to provide?

Provide architecture diagrams, component inventory, current telemetry configs, recent incidents, and any SLAs or KPI targets.

What outputs will I receive?

You get an observability plan, an instrumentation backlog with owners and acceptance criteria, SLO dashboard specs, and updated runbook/escalation notes.