home / skills / pluginagentmarketplace / custom-plugin-devops / observability

observability skill

/skills/observability

This skill helps you implement end-to-end observability using OpenTelemetry, Jaeger, and tracing to map services, collect logs and metrics across microservices.

npx playbooks add skill pluginagentmarketplace/custom-plugin-devops --skill observability

Review the files below or copy the command above to add this skill to your agents.

Files (8)
SKILL.md
760 B
---
name: observability
description: Distributed tracing with Jaeger, OpenTelemetry, and observability platforms for microservices insights
sasmp_version: "1.3.0"
bonded_agent: 06-monitoring-observability
bond_type: SECONDARY_BOND
---

# Observability Skill

## MANDATORY
- Three pillars: Logs, Metrics, Traces
- OpenTelemetry instrumentation
- Jaeger distributed tracing
- Trace context propagation
- Service dependency mapping

## OPTIONAL
- Zipkin tracing
- New Relic APM
- Datadog APM
- Honeycomb observability
- SLOs and error budgets

## ADVANCED
- Custom instrumentation
- Sampling strategies
- Continuous profiling
- Chaos engineering integration
- Observability-driven development

## Assets
- See `assets/observability-config.yaml` for tracing setup

Overview

This skill provides distributed tracing and observability patterns for microservices using Jaeger and OpenTelemetry. It helps teams capture logs, metrics, and traces to understand system behavior and dependencies. The focus is practical instrumentation, trace context propagation, and service dependency mapping to accelerate debugging and performance tuning.

How this skill works

Instrument services with OpenTelemetry libraries to emit traces, metrics, and logs. Traces are exported to Jaeger (with optional Zipkin or third-party APM integrations) and stitched across services via propagated trace context. Collected data is used to build dependency maps, visualize request flows, and identify latency or error hotspots.

When to use it

  • Deploying or debugging microservices with hard-to-reproduce latency or errors
  • Establishing end-to-end observability during CI/CD and progressive rollout
  • Monitoring service-to-service dependencies and request flow
  • Validating SLOs and investigating error budget consumption
  • Integrating observability into developer workflows for faster root-cause analysis

Best practices

  • Instrument critical paths first, then expand to broader coverage
  • Propagate trace context across all RPCs, queues, and async boundaries
  • Apply sensible sampling strategies to control telemetry volume
  • Correlate logs, metrics, and traces via consistent IDs and tags
  • Use service maps and latency histograms to prioritize optimization work

Example use cases

  • Trace a user request across frontend, auth, and backend to find the slow service
  • Measure tail latency during canary deployments to decide rollback thresholds
  • Map hidden service dependencies before refactoring or scaling
  • Integrate traces with alerting to reduce mean time to resolution for incidents
  • Experiment with custom instrumentation to capture business-specific context

FAQ

Which telemetry components are mandatory?

Collect logs, metrics, and traces as the three observability pillars; use OpenTelemetry for consistent instrumentation.

Can this work with commercial APMs?

Yes. While Jaeger is the primary tracer, the instrumentation supports export to Zipkin, New Relic, Datadog, or Honeycomb as optional backends.