home / skills / anton-abyzov / specweave / observability-engineer

observability-engineer skill

/plugins/specweave-infrastructure/skills/observability-engineer

This skill helps you design and deploy observability pipelines with OpenTelemetry, Prometheus, and Grafana to improve monitoring, tracing, and alerting.

This is most likely a fork of the sw-observability-engineer skill from openclaw
npx playbooks add skill anton-abyzov/specweave --skill observability-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
429 B
---
name: observability-engineer
description: Observability architect - OpenTelemetry-first, Prometheus+Grafana stack, SLIs/SLOs, alert fatigue prevention. Use for metrics, logs, traces setup.
model: opus
context: fork
---

## ⚠️ Chunking Rule

Large monitoring stacks (Prometheus + Grafana + OpenTelemetry + logs) = 1000+ lines. Generate ONE component per response: Metrics → Dashboards → Alerting → Tracing → Logs.

Overview

This skill is an observability architect that designs and implements OpenTelemetry-first observability for TypeScript projects using a Prometheus + Grafana stack. It focuses on SLIs/SLOs, alert fatigue prevention, and pragmatic observability patterns that scale with production systems. Use it to produce concrete artifacts for metrics, dashboards, alerting, tracing, or logs—one component per request for clarity and review.

How this skill works

Provide a target component (Metrics, Dashboards, Alerting, Tracing, or Logs) and the service context. The skill generates code, configuration, and recommended SLI/SLO definitions tailored to TypeScript services and common CI/CD workflows. It prioritizes OpenTelemetry instrumentation, Prometheus metrics exposition, and Grafana dashboards while including guidance to reduce alert noise.

When to use it

  • When starting observability for a new TypeScript service and you need a repeatable plan.
  • When adding metrics or traces to an existing codebase with OpenTelemetry.
  • When defining SLIs/SLOs and conversion to Prometheus recording rules and alerts.
  • When building Grafana dashboards from instrumented metrics.
  • When tuning alerting to reduce noise and prevent fatigue.

Best practices

  • Request and build one observability component per response to keep artifacts focused and reviewable.
  • Instrument code with OpenTelemetry semantic conventions before creating dashboards or alerts.
  • Define SLIs first, then derive SLOs and alerting thresholds from service-level behavior.
  • Use high-cardinality labels sparingly and rely on recording rules to precompute aggregates.
  • Favor multi-step alerting: advisory -> critical, with runbook links and actionable context.

Example use cases

  • Generate Prometheus instrumentation snippet and TypeScript wrapper for HTTP request metrics.
  • Create a Grafana dashboard spec with panels for latency SLOs, error budget, and throughput.
  • Produce alerting rules that implement a staged escalation and include silence schedules.
  • Provide OpenTelemetry trace sampling guidance and span naming for business transactions.
  • Design a log-correlation plan that links traces, metrics, and logs for incident troubleshooting.

FAQ

Can I ask for multiple components at once?

No. To keep output manageable and reviewable, request one component per response—Metrics, Dashboards, Alerting, Tracing, or Logs.

Does it produce runnable config and code?

Yes. Outputs include TypeScript snippets, Prometheus scrape/recording rules, Grafana JSON models, and OpenTelemetry config designed to be integrated into CI/CD.