home / skills / yonatangross / orchestkit / observability-monitoring

observability-monitoring skill

/plugins/ork/skills/observability-monitoring

This skill helps you implement structured logging, metrics, tracing, and alerting to improve observability across services and quickly diagnose issues.

npx playbooks add skill yonatangross/orchestkit --skill observability-monitoring

Review the files below or copy the command above to add this skill to your agents.

Files (15)
SKILL.md
8.9 KB
---
name: Observability & Monitoring
description: Use when adding logging, metrics, tracing, or alerting to applications. Observability & Monitoring covers structured logging, Prometheus metrics, OpenTelemetry tracing, and alerting strategies.
tags: [observability, monitoring, metrics, logging, tracing]
context: fork
agent: metrics-architect
version: 1.0.0
category: Operations & Reliability
agents: [backend-system-architect, code-quality-reviewer, llm-integrator]
keywords: [observability, monitoring, logging, metrics, tracing, alerts, Prometheus, OpenTelemetry]
author: OrchestKit
user-invocable: false
---

# Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

## Overview

- Setting up application monitoring
- Implementing structured logging
- Adding metrics and dashboards
- Configuring distributed tracing
- Creating alerting rules
- Debugging production issues

## Three Pillars of Observability

```
+-----------------+-----------------+-----------------+
|     LOGS        |     METRICS     |     TRACES      |
+-----------------+-----------------+-----------------+
| What happened   | How is system   | How do requests |
| at specific     | performing      | flow through    |
| point in time   | over time       | services        |
+-----------------+-----------------+-----------------+
```

## References

### Logging Patterns
**See: `references/logging-patterns.md`**

Key topics covered:
- Correlation IDs for cross-service request tracking
- Log sampling strategies for high-traffic systems
- LogQL queries for Loki log aggregation
- OrchestKit structlog configuration example

### Metrics Collection
**See: `references/metrics-collection.md`**

Key topics covered:
- Counter, Gauge, Histogram, Summary metric types
- Cardinality management and limits
- Custom business metrics (LLM tokens, cache hit rates)
- LLM cost tracking with Prometheus

### Distributed Tracing
**See: `references/distributed-tracing.md`**

Key topics covered:
- OpenTelemetry setup and auto-instrumentation
- Span relationships (parent/child, parallel)
- Head-based and tail-based sampling strategies
- Trace context propagation across services

### Alerting and Dashboards
**See: `references/alerting-dashboards.md`**

Key topics covered:
- Alert severity levels and response times
- Alert grouping and inhibition rules
- Escalation policies and runbook links
- Golden Signals dashboard design
- SLO/SLI definitions and error budgets

## Quick Reference

### Log Levels

| Level | Use Case |
|-------|----------|
| **ERROR** | Unhandled exceptions, failed operations |
| **WARN** | Deprecated API, retry attempts |
| **INFO** | Business events, successful operations |
| **DEBUG** | Development troubleshooting |

### RED Method (Rate, Errors, Duration)

Essential metrics for any service:
- **Rate** - Requests per second
- **Errors** - Failed requests per second
- **Duration** - Request latency distribution

### Prometheus Buckets

```typescript
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
```

### Key Alerts

| Alert | Condition | Severity |
|-------|-----------|----------|
| ServiceDown | `up == 0` for 1m | Critical |
| HighErrorRate | 5xx > 5% for 5m | Critical |
| HighLatency | p95 > 2s for 5m | High |
| LowCacheHitRate | < 70% for 10m | Medium |

### Health Checks (Kubernetes)

| Probe | Purpose | Endpoint |
|-------|---------|----------|
| **Liveness** | Is app running? | `/health` |
| **Readiness** | Ready for traffic? | `/ready` |
| **Startup** | Finished starting? | `/startup` |

## Observability Checklist

### Implementation
- [ ] JSON structured logging
- [ ] Request correlation IDs
- [ ] RED metrics (Rate, Errors, Duration)
- [ ] Business metrics
- [ ] Distributed tracing
- [ ] Health check endpoints

### Alerting
- [ ] Service outage alerts
- [ ] Error rate thresholds
- [ ] Latency thresholds
- [ ] Resource utilization alerts

### Dashboards
- [ ] Service overview
- [ ] Error analysis
- [ ] Performance metrics

## Templates Reference

| Template | Purpose |
|----------|---------|
| `structured-logging.ts` | Winston logger with request middleware |
| `prometheus-metrics.ts` | HTTP, DB, cache metrics with middleware |
| `opentelemetry-tracing.ts` | Distributed tracing setup |
| `alerting-rules.yml` | Prometheus alerting rules |
| `health-checks.ts` | Liveness, readiness, startup probes |

## Langfuse Integration

For LLM observability, use Langfuse decorators:

```python
from langfuse.decorators import observe, langfuse_context

@observe(name="analyze_content")
async def analyze_content(url: str) -> AnalysisResult:
    langfuse_context.update_current_trace(
        name="content_analysis",
        user_id="system",
        metadata={"url": url}
    )
    # ... workflow implementation
```

See `examples/orchestkit-monitoring-dashboard.md` for real-world examples.

## Extended Thinking Triggers

Use Opus 4.5 extended thinking for:
- **Incident investigation** - Correlating logs, metrics, traces
- **Alert tuning** - Reducing noise, catching real issues
- **Architecture decisions** - Choosing monitoring solutions
- **Performance debugging** - Cross-service latency analysis

---

## Related Skills

- `defense-in-depth` - Layer 8 observability as part of security architecture
- `devops-deployment` - Observability integration with CI/CD and Kubernetes
- `resilience-patterns` - Monitoring circuit breakers and failure scenarios
- `fastapi-advanced` - FastAPI-specific middleware for logging and metrics

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Log format | Structured JSON | Machine-parseable, supports log aggregation, enables queries |
| Metric types | RED method (Rate, Errors, Duration) | Industry standard, covers essential service health indicators |
| Tracing | OpenTelemetry | Vendor-neutral, auto-instrumentation, broad ecosystem support |
| Alerting severity | 4 levels (Critical, High, Medium, Low) | Clear escalation paths, appropriate response times |

---

## Capability Details

### structured-logging
**Keywords:** logging, structured log, json log, correlation id, log level, winston, pino, structlog
**Solves:**
- How do I set up structured logging?
- Implement correlation IDs across services
- JSON logging best practices

### correlation-tracking
**Keywords:** correlation id, request tracking, trace context, distributed logs
**Solves:**
- How do I track requests across services?
- Implement correlation IDs in middleware
- Find all logs for a single request

### log-sampling
**Keywords:** log sampling, high traffic logging, sampling rate, log volume
**Solves:**
- How do I reduce log volume in production?
- Sample INFO logs while keeping all errors

### prometheus-metrics
**Keywords:** metrics, prometheus, counter, histogram, gauge, summary, red method
**Solves:**
- How do I collect application metrics?
- Implement RED method (Rate, Errors, Duration)
- Choose between Counter, Gauge, Histogram

### metric-types
**Keywords:** counter, gauge, histogram, summary, bucket, quantile
**Solves:**
- When to use Counter vs Gauge?
- Histogram vs Summary for latency
- Configure histogram buckets

### cardinality-management
**Keywords:** cardinality, label explosion, time series, prometheus performance
**Solves:**
- How do I prevent label cardinality explosions?
- Fix unbounded labels (user IDs, request IDs)

### distributed-tracing
**Keywords:** tracing, distributed tracing, opentelemetry, span, trace id, waterfall
**Solves:**
- How do I implement distributed tracing?
- OpenTelemetry setup with auto-instrumentation
- Create manual spans for custom operations

### trace-sampling
**Keywords:** trace sampling, head-based sampling, tail-based sampling
**Solves:**
- How do I reduce trace volume?
- Sample 10% of traces but keep all errors

### alerting-strategy
**Keywords:** alert, alerting, notification, threshold, pagerduty, slack, severity
**Solves:**
- How do I set up effective alerts?
- Define alert severity levels (P1-P4)

### alert-fatigue-prevention
**Keywords:** alert fatigue, alert grouping, inhibition, escalation
**Solves:**
- How do I reduce alert noise?
- Group related alerts together

### dashboards
**Keywords:** dashboard, visualization, grafana, golden signals, red method
**Solves:**
- How do I create monitoring dashboards?
- Design Golden Signals dashboard layout
- Build SLO/SLI dashboards

### health-checks
**Keywords:** health check, liveness, readiness, startup probe, kubernetes
**Solves:**
- How do I implement health check endpoints?
- Difference between liveness and readiness

### langfuse-observability
**Keywords:** langfuse, llm observability, llm tracing, token usage, llm cost tracking
**Solves:**
- How do I monitor LLM calls with Langfuse?
- Track LLM token usage and cost

### llm-cost-tracking
**Keywords:** llm cost, token tracking, cost optimization, prometheus llm metrics
**Solves:**
- How do I track LLM costs with Prometheus?
- Measure token usage by model and operation

Overview

This skill provides a practical, production-ready guide for adding observability to applications using structured logging, Prometheus metrics, OpenTelemetry tracing, and alerting strategies. It bundles templates and patterns for JSON logging, RED metrics, distributed traces, health checks, and dashboard/alert design. Use it to instrument services so you can detect, investigate, and resolve production issues faster.

How this skill works

The skill inspects common observability needs and supplies concrete implementations: structured-logging middleware, Prometheus metrics collectors and buckets, and OpenTelemetry tracing setup with sampling strategies. It also includes alerting rule templates, health-check endpoints, and dashboard guidance (Golden Signals, SLOs/SLIs). Templates and checklists make it straightforward to add correlation IDs, reduce cardinality, and tune alerts for low noise.

When to use it

  • Adding request/response logging and correlation IDs across services
  • Instrumenting latency, error rate, and throughput using RED metrics
  • Setting up distributed tracing for cross-service request flow and root-cause analysis
  • Creating Prometheus alerts and Grafana dashboards for on-call teams
  • Implementing health checks and readiness probes for Kubernetes deployments

Best practices

  • Use structured JSON logs and include correlation IDs to link logs, traces, and metrics
  • Instrument Rate, Errors, and Duration (RED) plus key business metrics like cache hit rate and LLM token usage
  • Manage label cardinality—don’t use unbounded values (user IDs, request IDs) as metric labels
  • Apply head- and tail-based trace sampling: keep all error traces, sample normal traffic
  • Define clear alert severities, group related alerts, and link runbooks to reduce alert fatigue

Example use cases

  • Add a Winston or Pino JSON logger with request middleware and correlation-id propagation
  • Expose Prometheus counters, histograms, and gauges for HTTP, DB, and cache metrics with recommended buckets
  • Enable OpenTelemetry auto-instrumentation and create manual spans around critical business operations
  • Create Prometheus alerting rules: ServiceDown, HighErrorRate, HighLatency, LowCacheHitRate
  • Build Grafana dashboards: service overview, error analysis, performance (p95/p99) and SLO dashboards

FAQ

Which metric types should I use for latency?

Use histograms for latency to get bucketed distributions; summaries can be used for client-side quantiles but are less flexible for aggregation.

How do I avoid metric cardinality explosions?

Never use high-cardinality identifiers as labels; aggregate where possible, limit label values, and use recording rules to precompute heavy queries.

How should I tune alert noise?

Start with conservative thresholds, group related alerts, add inhibition rules, and iterate using on-call feedback and error budgets.