home / skills / adaptationio / skrillz / observability-stack-setup
observability-stack-setup skill

not checked
npx playbooks add skill adaptationio/skrillz --skill observability-stack-setup
Review the files below or copy the command above to add this skill to your agents.
Files (12)
SKILL.md
7.6 KB
---
name: observability-stack-setup
description: Automated LGTM + Alloy observability stack deployment using Docker Compose. Use when setting up Claude Code observability infrastructure locally.
---

# Observability Stack Setup

Automated deployment of the complete LGTM (Loki, Grafana, Tempo, Mimir/Prometheus) + Alloy observability stack for Claude Code monitoring.

## When to Use

- Setting up Claude Code observability for the first time
- Deploying local development observability infrastructure
- Need to monitor Claude Code operations (tool calls, costs, errors, performance)
- Want pre-configured dashboards for Claude Code analysis

## What This Skill Does

Automatically deploys and configures:
- **Grafana Alloy**: OTEL collector (receives telemetry from Claude Code)
- **Loki**: Log aggregation (stores all Claude Code logs)
- **Tempo**: Distributed tracing (tracks tool calls, API requests)
- **Prometheus**: Metrics storage (token usage, costs, performance)
- **Grafana**: Visualization with pre-built Claude Code dashboards

## Quick Start

### Prerequisites

```bash
# Verify Docker installed
docker --version  # Requires ≥ 20.10

# Verify Docker Compose installed
docker compose version  # Requires ≥ 2.0
```

### Deploy Stack

**Invoke this skill** and it will:
1. Create `.observability/` directory structure
2. Generate all configuration files
3. Start the stack with `docker compose up -d`
4. Import Claude Code dashboards
5. Verify all services healthy
6. Output access URLs and next steps

**Estimated time**: 5-10 minutes

## What Gets Deployed

### Services

| Service | Port | Purpose |
|---------|------|---------|
| Grafana | 3000 | Dashboards and visualization |
| Grafana Alloy | 4317 (gRPC), 4318 (HTTP), 12345 (metrics) | OTLP receiver |
| Loki | 3100 | Log storage and querying |
| Tempo | 3200 | Trace storage and querying |
| Prometheus | 9090 | Metrics storage and querying |

### Volumes

All data persisted in `.observability/volumes/`:
- `alloy-data/` - Alloy configuration and state
- `loki-data/` - Log storage
- `tempo-data/` - Trace storage
- `prometheus-data/` - Metrics storage
- `grafana-data/` - Dashboards, datasources, settings

### Pre-built Dashboards

1. **Claude Code Overview**
   - Session count, duration, active time
   - Token usage and cost trends
   - Error rates by tool
   - Top operations

2. **Tool Performance Matrix**
   - Call counts per tool
   - Average/P95/P99 latency
   - Success/failure rates
   - Most common errors

3. **Cost Analysis**
   - Daily/weekly/monthly costs
   - Token usage breakdown
   - Budget tracking
   - Cost projections

4. **Error Tracking**
   - Error timeline
   - Error types distribution
   - Affected tools
   - Recent error details

5. **Session Analysis**
   - Session duration distribution
   - Sessions per day/week
   - Conversation depth
   - Active vs idle time

## Workflow

### Step 1: Verify Prerequisites

Checks Docker and Docker Compose installed with compatible versions.

### Step 2: Create Directory Structure

```
.observability/
├── docker-compose.yml          # Main stack definition
├── alloy/
│   └── config.yaml            # OTLP receiver + exporters config
├── grafana/
│   ├── datasources/
│   │   ├── loki.yml           # Loki datasource
│   │   ├── prometheus.yml     # Prometheus datasource
│   │   └── tempo.yml          # Tempo datasource
│   └── dashboards/
│       ├── claude-code-overview.json
│       ├── tool-performance.json
│       ├── cost-analysis.json
│       ├── error-tracking.json
│       └── session-analysis.json
└── volumes/                   # Persistent data
    ├── alloy/
    ├── loki/
    ├── tempo/
    ├── prometheus/
    └── grafana/
```

### Step 3: Generate Configurations

Creates all configuration files from templates (see `references/` for details).

### Step 4: Start Stack

```bash
docker compose -f .observability/docker-compose.yml up -d
```

### Step 5: Health Checks

Verifies each service:
- Alloy: `http://localhost:12345/metrics`
- Loki: `http://localhost:3100/ready`
- Tempo: `http://localhost:3200/ready`
- Prometheus: `http://localhost:9090/-/healthy`
- Grafana: `http://localhost:3000/api/health`

### Step 6: Import Dashboards

Uses Grafana API to import all pre-built dashboards.

### Step 7: Output Success

Displays:
- Access URLs for all services
- Default credentials (admin/admin)
- OTLP endpoint for Claude Code configuration
- Next step: Enable Claude Code telemetry

## Configuration Details

### Grafana Alloy (OTLP Collector)

Receives telemetry from Claude Code via OTLP protocol:
- **gRPC endpoint**: `localhost:4317`
- **HTTP endpoint**: `localhost:4318`

Routes telemetry to backends:
- Logs → Loki
- Traces → Tempo
- Metrics → Prometheus

### Retention Policies

**Default: 365 days** (configurable in docker-compose.yml)

- **Loki**: 365 days (`-ingester.max-chunk-age=365d`)
- **Tempo**: 365 days (`-storage.trace.local.path retention`)
- **Prometheus**: 365 days (`--storage.tsdb.retention.time=365d`)

### Privacy Settings

**Full logging enabled** (no redactions):
- User prompts: Full content logged
- File paths: Complete paths visible
- Tool execution: Full command details
- API requests: All parameters visible

This configuration assumes observability for personal use with full data access.

## Troubleshooting

### Port Already in Use

If ports 3000, 3100, 3200, 4317, 4318, 9090, or 12345 are in use:

**Option 1**: Stop conflicting services
```bash
# Find process using port
sudo lsof -i :3000
# Stop the process
sudo kill <PID>
```

**Option 2**: Modify ports in `docker-compose.yml`

### Services Not Starting

Check logs:
```bash
docker compose -f .observability/docker-compose.yml logs [service_name]
```

Common issues:
- Insufficient disk space (check with `df -h`)
- Insufficient memory (Alloy needs ~512MB, others ~256MB each)
- Permission issues on volume directories

### Dashboards Not Appearing

Manually import:
```bash
# Copy dashboard JSON to container
docker cp .observability/grafana/dashboards/claude-code-overview.json \
  observability-grafana-1:/tmp/

# Import via API
curl -X POST http://localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -u admin:admin \
  -d @.observability/grafana/dashboards/claude-code-overview.json
```

## Next Steps

After stack is running:

1. **Enable Claude Code telemetry**: Use `claude-code-telemetry-enable` skill
2. **Use Claude Code**: Run tools, read files, execute commands
3. **View dashboards**: Open http://localhost:3000, explore pre-built dashboards
4. **Verify data flowing**: Check Grafana → Explore → Loki/Prometheus/Tempo

## Stopping the Stack

**Graceful shutdown** (preserves data):
```bash
docker compose -f .observability/docker-compose.yml down
```

**Complete removal** (deletes data):
```bash
docker compose -f .observability/docker-compose.yml down -v
```

## References

- `references/docker-compose-full.yml` - Complete Docker Compose configuration
- `references/alloy-config.yaml` - Grafana Alloy OTLP receiver configuration
- `references/grafana-datasources/` - Datasource YAML configurations
- `references/dashboards/` - Pre-built dashboard JSON files
- `references/troubleshooting.md` - Common issues and solutions

## Scripts

- `scripts/setup-stack.sh` - Main setup script (automated deployment)
- `scripts/verify-health.sh` - Health check all services
- `scripts/import-dashboards.sh` - Import Grafana dashboards

## Version Information

**Component Versions** (latest as of 2025-11-22):
- Grafana: 11.5.2
- Grafana Alloy: 1.5.0
- Loki: 3.4.2
- Tempo: 2.7.1
- Prometheus: 2.55.0

All versions pinned in docker-compose.yml for reproducibility.