home / skills / shotaiuchi / dotclaude / review-observability

review-observability skill

safe

This skill performs observability and debugging focused code review to improve logging, metrics, tracing, and production troubleshooting capabilities.

npx playbooks add skill shotaiuchi/dotclaude --skill review-observability

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.2 KB

---
name: review-observability
description: >-
  Observability and debugging-focused code review. Apply when reviewing
  logging, monitoring, metrics, tracing, alerting, structured logs,
  debug information, and production troubleshooting capability.
user-invocable: false
---

# Observability Review

Review code from an observability and debugging perspective.

## Review Checklist

### Logging
- Verify important operations are logged at appropriate levels
- Check log messages include sufficient context (IDs, parameters)
- Ensure structured logging format is used consistently
- Verify sensitive data is not logged (passwords, tokens, PII)

### Log Levels
- DEBUG: Detailed flow information for development
- INFO: Key business events and state transitions
- WARN: Recoverable issues that need attention
- ERROR: Failures requiring investigation
- Verify log level usage matches the above semantics

### Metrics & Monitoring
- Check key business metrics are tracked (request count, latency)
- Verify error rates are monitored with thresholds
- Ensure resource usage is tracked (memory, connections, queue depth)
- Check custom metrics for domain-specific health indicators

### Distributed Tracing
- Verify trace/correlation IDs propagate across service boundaries
- Check spans are created for significant operations
- Ensure trace context is included in log entries
- Verify async operations maintain trace context

### Alerting & Diagnostics
- Check health check endpoints exist and are meaningful
- Verify error conditions trigger appropriate alerts
- Ensure diagnostic information is accessible in production
- Check feature flags and configuration are observable

### Debugging Support
- Verify error messages help identify root cause
- Check stack traces are preserved through error handling
- Ensure request/response payloads are loggable (at debug level)
- Verify state transitions are traceable in logs

## Output Format

| Priority | Description |
|----------|-------------|
| Critical | No visibility into failure path, blind spot in production |
| High | Key operation not observable, hard to troubleshoot |
| Medium | Logging exists but lacks context or structure |
| Low | Enhancement for faster debugging |

Overview

This skill performs observability and debugging-focused code reviews. It inspects logging, metrics, tracing, alerting, and runtime diagnostics to surface visibility gaps that impede production troubleshooting. Use it to ensure teams can detect, diagnose, and respond to incidents quickly.

How this skill works

The skill evaluates source changes against an observability checklist: log presence and levels, structured logging, metric collection, trace propagation, alerting triggers, health endpoints, and diagnostic accessibility. It flags missing or risky practices and categorizes findings by priority (Critical, High, Medium, Low) to guide remediation. It provides concrete examples and remediation hints when possible.

When to use it

When reviewing changes that touch logging, error handling, or instrumentation
Before merging services with new network, async, or distributed behaviors
When adding or changing metrics, alerts, or health checks
When introducing feature flags, config changes, or sensitive data paths
During incident postmortems to validate observability gaps are addressed

Best practices

Log business-significant events at INFO, debug flow details at DEBUG, and failures at ERROR or WARN as appropriate
Always emit structured logs (JSON or key=value) with correlation IDs and key context fields
Avoid logging sensitive data; redact or transform PII, tokens, and passwords
Export key business and system metrics (counts, latencies, error rates, resource usage) and attach thresholds for alerting
Propagate trace/correlation IDs through service boundaries and include them in logs for combined trace-log analysis
Provide meaningful health endpoints and attach actionable alerts that include runbook links or next steps

Example use cases

Review a pull request that replaces sync calls with async background jobs to ensure trace context and error logging survive async boundaries
Audit a service that recently added third-party auth to confirm tokens are not logged and authentication failures are observable
Evaluate new metrics and alerts to verify thresholds, cardinality, and tagging are appropriate for dashboards and SLOs
Assess a refactor of error handling to ensure stack traces and root-cause context are preserved
Validate a rollout that introduces feature flags to confirm flag state and impact metrics are emitted

FAQ

What constitutes a Critical finding?

A Critical finding means the change removes or omits visibility into a failure path (no logs, no metrics, missing health checks) that would leave production blind during incidents.

How should I handle sensitive data in logs?

Never log raw sensitive fields. Mask, hash, or omit PII and secrets. Use allowlists for safe fields and document redaction rules in code and config.