home / skills / williamzujkowski / cognitive-toolworks / observability-stack-configurator

observability-stack-configurator skill

safe

/skills/observability-stack-configurator

This skill helps you configure a complete observability stack with metrics, logs, tracing, and alerting across platforms for reliable visibility.

npx playbooks add skill williamzujkowski/cognitive-toolworks --skill observability-stack-configurator

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

14.0 KB

---
name: "Observability Stack Configurator"
slug: observability-stack-configurator
description: "Configure comprehensive observability with metrics, logging, tracing, and alerting using Prometheus, OpenTelemetry, CloudWatch, and Grafana."
capabilities:
  - configure_metrics_collection
  - configure_logging_aggregation
  - configure_distributed_tracing
  - configure_alerting_rules
inputs:
  platform:
    type: string
    description: "Platform: kubernetes, aws, azure, gcp, on-premise"
    required: true
  tech_stack:
    type: array
    description: "Application technologies to instrument"
    required: true
  requirements:
    type: object
    description: "SLIs, alerting rules, retention policies, dashboard specifications"
    required: true
outputs:
  metrics_config:
    type: code
    description: "Prometheus, CloudWatch, or Datadog configuration"
  logging_config:
    type: code
    description: "Logging stack configuration (ELK, Loki, CloudWatch Logs)"
  tracing_config:
    type: code
    description: "OpenTelemetry or X-Ray instrumentation"
  dashboards:
    type: array
    description: "Grafana or CloudWatch dashboard definitions"
keywords:
  - observability
  - prometheus
  - opentelemetry
  - grafana
  - cloudwatch
  - logging
  - tracing
  - metrics
  - alerting
  - monitoring
version: 1.0.0
owner: william@cognitive-toolworks
license: MIT
security:
  pii: false
  secrets: false
  sandbox: required
links:
  - https://prometheus.io/docs/
  - https://opentelemetry.io/docs/
  - https://grafana.com/docs/
  - https://aws.amazon.com/cloudwatch/
---

## Purpose & When-To-Use

**Trigger conditions:**

- Production incidents reveal lack of visibility into system behavior
- Application deployment without monitoring or alerting
- Troubleshooting requires distributed tracing across microservices
- SLO/SLA commitments require metrics and alerting
- Compliance or audit requires centralized logging
- Performance optimization needs detailed metrics

**Use this skill when** you need a complete observability stack with metrics collection, log aggregation, distributed tracing, and intelligent alerting.

---

## Pre-Checks

**Before execution, verify:**

1. **Time normalization**: `NOW_ET = 2025-10-26T01:33:56-04:00` (NIST/time.gov semantics, America/New_York)
2. **Input schema validation**:
   - `platform` is one of: `kubernetes`, `aws`, `azure`, `gcp`, `on-premise`
   - `tech_stack` contains instrumentable technologies
   - `requirements.slis` defines key service level indicators
   - `requirements.alerting_rules` specifies conditions and thresholds
   - `requirements.retention_policies` defines data retention periods
3. **Source freshness**: All cited sources accessed on `NOW_ET`; verify documentation links current
4. **Platform compatibility**: Confirm observability tools available on target platform

**Abort conditions:**

- Platform doesn't support required observability tools
- Tech stack cannot be instrumented (proprietary, closed-source without metrics endpoint)
- Conflicting requirements (e.g., "zero cost" with "15-second granularity metrics")
- Retention requirements violate regulatory constraints

---

## Procedure

### Tier 1 (Fast Path, ≤2k tokens)

**Token budget**: ≤2k tokens

**Scope**: Basic observability with essential metrics, logs, and simple alerting.

**Steps:**

1. **Design observability architecture** (500 tokens):
   - Select observability stack based on platform:
     - **Kubernetes**: Prometheus + Grafana + Loki
     - **AWS**: CloudWatch Metrics + Logs + X-Ray
     - **Azure**: Azure Monitor + Application Insights
     - **GCP**: Cloud Monitoring + Cloud Logging + Cloud Trace
   - Identify instrumentation points in application code
   - Define essential metrics (RED: Rate, Errors, Duration; USE: Utilization, Saturation, Errors)

2. **Generate observability configurations** (1500 tokens):
   - **Metrics**:
     - Prometheus scrape configs or CloudWatch metric filters
     - Application instrumentation snippets (client libraries)
     - Essential metrics: request rate, error rate, latency percentiles (p50, p95, p99)
   - **Logging**:
     - Log aggregation configuration (Loki, CloudWatch Logs, ELK)
     - Structured logging format (JSON)
     - Log retention policies (7-30 days for development)
   - **Basic alerting**:
     - Critical alerts: service down, error rate >5%, latency >1s
     - Alert routing configuration (email, Slack, PagerDuty)
   - **Simple dashboards**:
     - Service health overview (uptime, request rate, error rate, latency)
     - Infrastructure metrics (CPU, memory, disk, network)

**Decision point**: If requirements include distributed tracing, SLO tracking, advanced analytics, or multi-cluster → escalate to T2.

---

### Tier 2 (Extended Analysis, ≤6k tokens)

**Token budget**: ≤6k tokens

**Scope**: Comprehensive observability with distributed tracing, SLO tracking, advanced alerting, and correlation.

**Steps:**

1. **Design comprehensive observability** (2000 tokens):
   - **Distributed tracing** (accessed 2025-10-26T01:33:56-04:00):
     - **OpenTelemetry**: Language-agnostic instrumentation for metrics, logs, traces
     - Trace context propagation across service boundaries (W3C Trace Context)
     - Sampling strategies (head-based, tail-based) for cost optimization
     - Integration with Jaeger, Zipkin, or cloud-native solutions (X-Ray, Cloud Trace)
   - **SLO tracking**:
     - Define SLIs from requirements (availability, latency, error rate)
     - Calculate SLO compliance and error budgets
     - Configure SLO dashboards with burn rate alerts
   - **Advanced metrics**:
     - Business metrics (conversion rate, transaction volume)
     - Application performance monitoring (APM) with detailed breakdowns
     - Custom metrics for domain-specific monitoring
   - **Log correlation**:
     - Trace ID injection into logs for correlation
     - Structured logging with consistent fields
     - Log-based metrics for pattern detection

2. **Generate comprehensive configurations** (4000 tokens):
   - **Prometheus/CloudWatch advanced**:
     - Recording rules for precomputed aggregations
     - Federation for multi-cluster metrics
     - Long-term storage (Thanos, Cortex, or cloud-native)
     - Service discovery for dynamic targets
   - **OpenTelemetry instrumentation**:
     - Auto-instrumentation for common frameworks
     - Custom spans for business-critical operations
     - Baggage propagation for cross-service context
     - Collector configuration with processors and exporters
   - **Advanced alerting**:
     - Multi-condition alerts with logical operators
     - Anomaly detection for dynamic thresholds
     - Alert grouping and deduplication
     - Escalation policies and on-call schedules
     - Runbook links in alert descriptions
   - **Comprehensive dashboards**:
     - Service dependency maps
     - SLO compliance tracking
     - Cost attribution and optimization
     - Capacity planning metrics
   - **Log analytics**:
     - Full-text search and filtering
     - Log-based alerting
     - Anomaly detection in logs
     - Compliance audit trails

**Sources cited** (accessed 2025-10-26T01:33:56-04:00):

- **Prometheus Best Practices**: https://prometheus.io/docs/practices/
- **OpenTelemetry**: https://opentelemetry.io/docs/concepts/
- **Grafana Dashboards**: https://grafana.com/docs/grafana/latest/dashboards/
- **Google SRE Monitoring**: https://sre.google/sre-book/monitoring-distributed-systems/

---

### Tier 3 (Deep Dive, ≤12k tokens)

**Token budget**: ≤12k tokens

**Scope**: Enterprise observability with AI/ML insights, cost optimization, and security monitoring.

**Steps:**

1. **AI/ML-enhanced observability** (4000 tokens):
   - Anomaly detection with machine learning models
   - Predictive alerting based on historical patterns
   - Root cause analysis automation
   - Capacity forecasting with time-series prediction
   - Automated incident triage and correlation

2. **Advanced analytics and optimization** (4000 tokens):
   - Observability data lake for long-term analysis
   - Cost optimization through sampling and aggregation strategies
   - Multi-tenancy with namespace isolation
   - Cardinality management for high-dimensional metrics
   - Query optimization and performance tuning
   - Data retention tiering (hot/warm/cold storage)

3. **Security and compliance monitoring** (4000 tokens):
   - Security event logging and SIEM integration
   - Audit trail generation for compliance (SOC2, HIPAA, PCI-DSS)
   - Sensitive data masking in logs
   - Access control and authentication for observability tools
   - Encryption at rest and in transit for telemetry data
   - Compliance reporting and evidence collection

**Additional sources** (accessed 2025-10-26T01:33:56-04:00):

- **OpenTelemetry Collector**: https://opentelemetry.io/docs/collector/
- **Thanos**: https://thanos.io/tip/thanos/quick-tutorial.md
- **AWS Observability Best Practices**: https://aws-observability.github.io/observability-best-practices/

---

## Decision Rules

**Observability stack selection:**

- **Prometheus + Grafana**: Open-source, Kubernetes-native, vendor-neutral
- **CloudWatch**: AWS-native, tight integration, managed service
- **Datadog/New Relic**: Comprehensive SaaS, fast setup, higher cost
- **Elastic Stack (ELK)**: Powerful log analytics, full-text search
- **OpenTelemetry**: Vendor-agnostic instrumentation, future-proof

**Metric collection strategy:**

- **Pull-based (Prometheus)**: Good for dynamic environments, service discovery
- **Push-based (CloudWatch)**: Good for ephemeral workloads (Lambda, batch jobs)
- **Hybrid**: Use both based on workload characteristics

**Retention policies:**

- **Metrics**: 15 days high-resolution, 90 days aggregated, 1 year downsampled
- **Logs**: 7-30 days searchable, longer for compliance (1-7 years)
- **Traces**: 7-14 days with sampling (1-10% of traces)

**Escalation conditions:**

- Novel platform without established observability patterns
- Requirements exceed T3 scope (custom data pipeline, ML model training)
- Compliance requirements need specialized tools (SIEM, DLP)

**Abort conditions:**

- Platform restrictions prevent telemetry export
- Conflicting requirements (e.g., "no network egress" with "SaaS monitoring")
- Cost constraints incompatible with retention/granularity requirements

---

## Output Contract

**Required outputs:**

```json
{
  "metrics_config": {
    "type": "object",
    "properties": {
      "platform": "string (prometheus|cloudwatch|datadog)",
      "scrape_configs": "string (YAML configuration)",
      "recording_rules": "string (optional aggregation rules)",
      "retention": "string (duration)"
    }
  },
  "logging_config": {
    "type": "object",
    "properties": {
      "platform": "string (loki|cloudwatch-logs|elasticsearch)",
      "aggregation_config": "string (configuration)",
      "retention_policy": "string (duration or storage class)",
      "structured_format": "string (JSON schema)"
    }
  },
  "tracing_config": {
    "type": "object",
    "properties": {
      "platform": "string (opentelemetry|jaeger|x-ray)",
      "instrumentation": "string (language-specific code)",
      "sampling_rate": "number (0.0 to 1.0)",
      "exporter_config": "string (backend configuration)"
    }
  },
  "dashboards": {
    "type": "array",
    "items": {
      "name": "string",
      "platform": "string (grafana|cloudwatch)",
      "definition": "string (JSON or YAML)"
    }
  },
  "alerting_rules": {
    "type": "array",
    "items": {
      "name": "string",
      "condition": "string (PromQL or equivalent)",
      "severity": "string (critical|warning|info)",
      "notification_channel": "string"
    }
  }
}
```

**Quality guarantees:**

- Metrics cover RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) methods
- Logs are structured with consistent fields (timestamp, level, message, trace_id)
- Traces propagate context across service boundaries
- Alerting rules avoid false positives with appropriate thresholds
- Dashboards provide actionable insights (not vanity metrics)

---

## Examples

**Example: Prometheus scrape config with OpenTelemetry**

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-service'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['production']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: api
        action: keep

alerting_rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
```

---

## Quality Gates

**Token budgets:**

- **T1**: ≤2k tokens (basic metrics, logs, alerting)
- **T2**: ≤6k tokens (distributed tracing, SLO tracking, advanced alerting)
- **T3**: ≤12k tokens (AI/ML insights, security monitoring, compliance)

**Safety checks:**

- No sensitive data (PII, credentials) in logs or metrics
- Encryption configured for telemetry data in transit and at rest
- Access controls on observability dashboards and data
- Cost controls to prevent runaway metric cardinality

**Auditability:**

- All configuration changes version-controlled
- Alert history retained for incident retrospectives
- Compliance logs immutable and tamper-evident

**Determinism:**

- Same inputs produce identical observability configurations
- Alerting thresholds based on data-driven baselines
- Dashboard definitions reproducible from code

---

## Resources

**Official Documentation** (accessed 2025-10-26T01:33:56-04:00):

- Prometheus: https://prometheus.io/docs/
- OpenTelemetry: https://opentelemetry.io/docs/
- Grafana: https://grafana.com/docs/
- AWS CloudWatch: https://docs.aws.amazon.com/cloudwatch/

**Best Practices** (accessed 2025-10-26T01:33:56-04:00):

- Google SRE Book - Monitoring: https://sre.google/sre-book/monitoring-distributed-systems/
- RED Method: https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/
- USE Method: https://www.brendangregg.com/usemethod.html

**Templates** (in repository `/resources/`):

- Prometheus configurations for common platforms
- OpenTelemetry instrumentation examples
- Grafana dashboard templates
- CloudWatch alarm and dashboard definitions

Overview

This skill configures a complete observability stack across metrics, logs, tracing, dashboards, and alerting using Prometheus, OpenTelemetry, CloudWatch, and Grafana. It generates platform-specific agent artifacts and configuration snippets to deploy consistent telemetry across Kubernetes, cloud providers, and on-premise environments. The goal is actionable visibility that supports SLOs, incident response, and long-term analysis.

How this skill works

The skill validates input (platform, tech stack, SLIs, alerting rules, retention policies) and selects an appropriate stack: Prometheus/Grafana for Kubernetes, CloudWatch/X-Ray for AWS, or OpenTelemetry-based pipelines for multi-cloud and vendor-neutral setups. It produces metrics scrape configs, logging aggregation settings, tracing instrumentation snippets, dashboard definitions, and alerting rules tailored to the specified token budget (T1/T2/T3). It includes decision and abort checks for platform compatibility, instrumentation feasibility, and conflicting requirements.

When to use it

Deploying an application without monitoring or alerting in production
Troubleshooting microservices that require distributed tracing and correlation
Meeting SLO/SLA commitments with SLI/SLO tracking and burn-rate alerts
Centralizing logs for compliance, audit, or forensic purposes
Optimizing performance and capacity planning with long-term metrics

Best practices

Instrument RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) metrics as defaults
Use structured JSON logs and inject trace_id for correlation
Prefer OpenTelemetry for vendor-agnostic instrumentation and collector pipelines
Apply appropriate sampling and retention: high-resolution short-term, downsampled long-term
Avoid high-cardinality metric labels; use recording rules and long-term storage solutions

Example use cases

Kubernetes cluster: Prometheus scrape config + Grafana dashboards + Loki for logs
AWS service: CloudWatch metrics/logs + X-Ray traces + alerting to PagerDuty
Multi-service app: OpenTelemetry auto-instrumentation with Jaeger/Thanos for scale
SLO program: derive SLIs from instrumented metrics and configure burn-rate alerts
Compliance: centralized log retention (immutable audit trail) and SIEM integration

FAQ

What inputs are required to generate configurations?

Provide platform, tech_stack, requirements.slis, requirements.alerting_rules, and requirements.retention_policies to validate feasibility and produce tailored configs.

How does the skill choose Prometheus vs. CloudWatch?

Selection is based on platform compatibility, operational model, and constraints: Kubernetes favors Prometheus/Grafana; AWS favors CloudWatch; multi-cloud or vendor-neutral use OpenTelemetry.

When should I escalate from basic to extended analysis?

Escalate when you need distributed tracing, SLO tracking, advanced alerting, multi-cluster federation, or long-term aggregated storage.