home / skills / bobmatnyc / claude-mpm-skills / observability

observability skill

safe

This skill helps you instrument Go services with OpenTelemetry, Prometheus, and slog for traces, metrics, and structured logging across services.

npx playbooks add skill bobmatnyc/claude-mpm-skills --skill observability

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

25.3 KB

---
name: golang-observability-opentelemetry
description: "Instrumenting Go applications with OpenTelemetry for distributed tracing, Prometheus for metrics, and structured logging with slog"
version: 1.0.0
category: toolchain
author: Claude MPM Team
license: MIT
progressive_disclosure:
  entry_point:
    summary: "Master Go observability through OpenTelemetry distributed tracing, Prometheus metrics, and slog structured logging for production-ready microservices"
    when_to_use: "Instrumenting microservices for observability, setting up distributed tracing across services, creating operational dashboards, debugging production issues, performance monitoring and optimization"
    quick_start: "1. Initialize OpenTelemetry tracer and exporter 2. Add Prometheus metrics endpoint 3. Configure slog for structured logging 4. Instrument HTTP middleware 5. Implement health checks"
  token_estimate:
    entry: 150
    full: 5000
context_limit: 700
tags:
  - observability
  - golang
  - opentelemetry
  - prometheus
  - tracing
  - metrics
  - logging
  - slog
requires_tools: []
---

# Go Observability with OpenTelemetry

## Overview

Modern Go applications require comprehensive observability through the three pillars: traces, metrics, and logs. OpenTelemetry provides vendor-neutral instrumentation for distributed tracing, Prometheus offers powerful metrics collection, and Go's slog package (1.21+) delivers structured logging with minimal overhead.

**Key Features:**
- 🔍 **OpenTelemetry**: Distributed tracing with context propagation
- 📊 **Prometheus**: Metrics collection with /metrics endpoint
- 📝 **Structured Logging**: slog with JSON formatting and correlation IDs
- 🎯 **Auto-Instrumentation**: HTTP/gRPC middleware patterns
- 💚 **Health Checks**: Kubernetes-ready readiness/liveness probes
- 🔄 **Graceful Shutdown**: Clean exporter shutdown and signal handling

## When to Use This Skill

Activate this skill when:
- Instrumenting microservices for production observability
- Setting up distributed tracing across service boundaries
- Creating operational dashboards with Prometheus/Grafana
- Debugging production performance issues or bottlenecks
- Implementing SLOs and monitoring SLIs
- Adding observability to existing Go applications
- Correlating logs, traces, and metrics for debugging

## Core Observability Principles

### The Three Pillars

1. **Traces**: Understand request flow across distributed systems
2. **Metrics**: Measure system behavior and performance over time
3. **Logs**: Record discrete events for debugging and audit

### Correlation Strategy

All three pillars must share common identifiers:
- **Trace ID**: Links all operations in a request
- **Span ID**: Identifies specific operation within trace
- **Request ID**: Correlates logs with traces and metrics

## OpenTelemetry Integration

### Installation

```bash
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
go get go.opentelemetry.io/otel/exporters/jaeger
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
```

### Basic Setup

```go
package main

import (
    "context"
    "log"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
    // Create Jaeger exporter
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://localhost:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    // Create resource with service name
    res, err := resource.Merge(
        resource.Default(),
        resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Create tracer provider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()), // Use probability sampler in production
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }
    defer func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Printf("Error shutting down tracer: %v", err)
        }
    }()

    // Application code...
}
```

### Creating Spans

```go
import (
    "context"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/trace"
)

func ProcessOrder(ctx context.Context, order Order) error {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()

    // Add attributes
    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.Int("order.items", len(order.Items)),
        attribute.Float64("order.total", order.Total),
    )

    // Validate order (creates child span)
    if err := validateOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    // Fulfill order
    if err := fulfillOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "fulfillment failed")
        return err
    }

    span.SetStatus(codes.Ok, "order processed successfully")
    return nil
}

func validateOrder(ctx context.Context, order Order) error {
    _, span := otel.Tracer("order-service").Start(ctx, "validateOrder")
    defer span.End()

    // Validation logic...
    return nil
}
```

### HTTP Middleware Instrumentation

```go
import (
    "net/http"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func main() {
    // Wrap handler with automatic tracing
    handler := http.HandlerFunc(orderHandler)
    wrappedHandler := otelhttp.NewHandler(handler, "order-handler")

    http.Handle("/orders", wrappedHandler)
    http.ListenAndServe(":8080", nil)
}

// Manual instrumentation for more control
func orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-service")

    ctx, span := tracer.Start(ctx, "orderHandler")
    defer span.End()

    // Extract order ID from request
    orderID := r.URL.Query().Get("id")
    span.SetAttributes(attribute.String("order.id", orderID))

    // Process order with propagated context
    order, err := fetchOrder(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        http.Error(w, "Order not found", http.StatusNotFound)
        return
    }

    // ... handle response
}
```

## Prometheus Metrics

### Installation

```bash
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp
```

### Metric Types and Patterns

```go
package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Counter: Monotonically increasing value
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    // Gauge: Value that can go up or down
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )

    // Histogram: Observations bucketed by value
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets, // [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        },
        []string{"method", "path"},
    )

    // Summary: Similar to histogram but calculates quantiles
    dbQueryDuration = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "db_query_duration_seconds",
            Help:       "Database query duration",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"query_type"},
    )
)
```

### Metrics Middleware

```go
import (
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Metrics middleware that instruments all HTTP handlers
func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Track active connections
        activeConnections.Inc()
        defer activeConnections.Dec()

        // Wrap response writer to capture status code
        rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}

        // Call next handler
        next.ServeHTTP(rw, r)

        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(rw.statusCode)).Inc()
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// Expose metrics endpoint
func main() {
    http.Handle("/metrics", promhttp.Handler())

    handler := MetricsMiddleware(http.HandlerFunc(orderHandler))
    http.Handle("/orders", handler)

    http.ListenAndServe(":8080", nil)
}
```

### Custom Metrics Example

```go
func ProcessPayment(ctx context.Context, payment Payment) error {
    timer := prometheus.NewTimer(dbQueryDuration.WithLabelValues("payment_insert"))
    defer timer.ObserveDuration()

    // Process payment
    if err := db.InsertPayment(payment); err != nil {
        httpRequestsTotal.WithLabelValues("POST", "/payments", "500").Inc()
        return err
    }

    httpRequestsTotal.WithLabelValues("POST", "/payments", "200").Inc()
    return nil
}
```

## Structured Logging with slog

### Basic Setup (Go 1.21+)

```go
package main

import (
    "context"
    "log/slog"
    "os"
)

func initLogger() *slog.Logger {
    // JSON logger for production
    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: slog.LevelInfo,
        AddSource: true, // Include file:line information
    })

    logger := slog.New(handler)
    slog.SetDefault(logger) // Set as default logger
    return logger
}

func main() {
    logger := initLogger()

    logger.Info("service starting",
        "service", "order-service",
        "version", "1.0.0",
        "port", 8080,
    )
}
```

### Context-Aware Logging

```go
import (
    "context"
    "log/slog"

    "go.opentelemetry.io/otel/trace"
)

// Add trace context to logger
func LoggerWithTrace(ctx context.Context) *slog.Logger {
    span := trace.SpanFromContext(ctx)
    spanCtx := span.SpanContext()

    return slog.With(
        "trace_id", spanCtx.TraceID().String(),
        "span_id", spanCtx.SpanID().String(),
    )
}

func HandleRequest(ctx context.Context, req Request) error {
    logger := LoggerWithTrace(ctx)

    logger.Info("processing request",
        "request_id", req.ID,
        "method", req.Method,
        "path", req.Path,
    )

    if err := processRequest(ctx, req); err != nil {
        logger.Error("request failed",
            "error", err,
            "duration_ms", time.Since(req.StartTime).Milliseconds(),
        )
        return err
    }

    logger.Info("request completed successfully",
        "duration_ms", time.Since(req.StartTime).Milliseconds(),
    )
    return nil
}
```

### Log Levels and Structured Fields

```go
func ProcessOrder(ctx context.Context, order Order) error {
    logger := LoggerWithTrace(ctx).With(
        "order_id", order.ID,
        "user_id", order.UserID,
    )

    logger.Debug("validating order", "items", len(order.Items))

    if len(order.Items) == 0 {
        logger.Warn("empty order received")
        return ErrEmptyOrder
    }

    logger.Info("order validation passed")

    if err := fulfillOrder(ctx, order); err != nil {
        logger.Error("fulfillment failed",
            "error", err,
            slog.Group("order_details",
                "total", order.Total,
                "items", len(order.Items),
            ),
        )
        return err
    }

    logger.Info("order processed successfully",
        "total", order.Total,
    )
    return nil
}
```

## Health Checks and Graceful Shutdown

### Health Check Endpoints

```go
import (
    "context"
    "database/sql"
    "encoding/json"
    "net/http"
    "time"
)

type HealthChecker struct {
    db *sql.DB
    // Add other dependencies
}

type HealthStatus struct {
    Status      string            `json:"status"`
    Version     string            `json:"version"`
    Checks      map[string]string `json:"checks"`
    Timestamp   time.Time         `json:"timestamp"`
}

// Liveness probe - is the app running?
func (hc *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{
        "status": "alive",
    })
}

// Readiness probe - is the app ready to serve traffic?
func (hc *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    status := HealthStatus{
        Status:    "ready",
        Version:   "1.0.0",
        Checks:    make(map[string]string),
        Timestamp: time.Now(),
    }

    // Check database
    if err := hc.db.PingContext(ctx); err != nil {
        status.Status = "not_ready"
        status.Checks["database"] = "unhealthy: " + err.Error()
        w.WriteHeader(http.StatusServiceUnavailable)
    } else {
        status.Checks["database"] = "healthy"
    }

    // Add more dependency checks (Redis, external APIs, etc.)

    w.Header().Set("Content-Type", "application/json")
    if status.Status == "ready" {
        w.WriteHeader(http.StatusOK)
    }
    json.NewEncoder(w).Encode(status)
}
```

### Graceful Shutdown

```go
import (
    "context"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    // Initialize tracer
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }

    // Setup HTTP server
    server := &http.Server{
        Addr:    ":8080",
        Handler: setupRoutes(),
    }

    // Channel for shutdown signals
    shutdown := make(chan os.Signal, 1)
    signal.Notify(shutdown, os.Interrupt, syscall.SIGTERM)

    // Start server in goroutine
    go func() {
        slog.Info("server starting", "port", 8080)
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for shutdown signal
    <-shutdown
    slog.Info("shutdown signal received")

    // Create shutdown context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // Shutdown HTTP server
    slog.Info("shutting down HTTP server")
    if err := server.Shutdown(ctx); err != nil {
        slog.Error("HTTP server shutdown error", "error", err)
    }

    // Shutdown tracer provider (flush spans)
    slog.Info("shutting down tracer")
    if err := tp.Shutdown(ctx); err != nil {
        slog.Error("tracer shutdown error", "error", err)
    }

    slog.Info("shutdown complete")
}
```

## Complete Instrumentation Example

```go
package main

import (
    "context"
    "database/sql"
    "log/slog"
    "net/http"
    "os"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

type Server struct {
    db     *sql.DB
    logger *slog.Logger
}

func (s *Server) orderHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    // Get tracer and create span
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "orderHandler")
    defer span.End()

    // Create context-aware logger with trace ID
    logger := s.logger.With(
        "trace_id", span.SpanContext().TraceID().String(),
        "request_id", r.Header.Get("X-Request-ID"),
    )

    orderID := r.URL.Query().Get("id")
    span.SetAttributes(attribute.String("order.id", orderID))

    logger.Info("fetching order", "order_id", orderID)

    // Fetch order from database
    order, err := s.fetchOrder(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        logger.Error("failed to fetch order", "error", err)
        http.Error(w, "Order not found", http.StatusNotFound)
        return
    }

    logger.Info("order fetched successfully",
        "order_id", orderID,
        "items", len(order.Items),
    )

    // Return order as JSON
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(order)
}

func (s *Server) fetchOrder(ctx context.Context, orderID string) (*Order, error) {
    _, span := otel.Tracer("order-service").Start(ctx, "fetchOrder")
    defer span.End()

    // Time database query
    start := time.Now()

    var order Order
    err := s.db.QueryRowContext(ctx, "SELECT * FROM orders WHERE id = ?", orderID).Scan(&order)

    duration := time.Since(start).Seconds()
    dbQueryDuration.WithLabelValues("select_order").Observe(duration)

    return &order, err
}

func setupRoutes(s *Server, hc *HealthChecker) http.Handler {
    mux := http.NewServeMux()

    // Health endpoints (no tracing needed)
    mux.HandleFunc("/health", hc.LivenessHandler)
    mux.HandleFunc("/ready", hc.ReadinessHandler)
    mux.Handle("/metrics", promhttp.Handler())

    // Business endpoints (with tracing)
    orderHandler := http.HandlerFunc(s.orderHandler)
    mux.Handle("/orders", otelhttp.NewHandler(orderHandler, "orders"))

    // Wrap everything with metrics middleware
    return MetricsMiddleware(mux)
}
```

## Decision Trees

### When to Use OpenTelemetry

**Use OpenTelemetry When:**
- Building distributed systems with multiple services
- Need to trace requests across service boundaries
- Debugging performance issues in microservices
- Want vendor-neutral observability (switch backends easily)
- Require correlation between traces, metrics, and logs

**Don't Use OpenTelemetry When:**
- Building simple monolithic applications
- Performance overhead is critical (consider sampling)
- Team lacks observability infrastructure (Jaeger, Zipkin)

### When to Use Prometheus

**Use Prometheus When:**
- Need time-series metrics for monitoring and alerting
- Building operational dashboards (Grafana)
- Measuring SLIs for SLO compliance
- Tracking business metrics (requests/sec, conversion rates)
- Kubernetes/containerized environments

**Don't Use Prometheus When:**
- Need high-cardinality metrics (Prometheus has limits)
- Require long-term metric storage (use Thanos/Cortex)
- Need push-based metrics (Prometheus is pull-based)

### When to Use slog

**Use slog When:**
- Go 1.21+ projects (standard library, zero dependencies)
- Need structured logging with JSON output
- Want high-performance logging with minimal allocations
- Integrating with log aggregation systems (Loki, ELK)

**Don't Use slog When:**
- Go < 1.21 (use zap or zerolog instead)
- Need complex log routing or filtering (use zap)
- Require very specific features (audit trails, etc.)

### Sampling Strategy Decision

**Always Sample When:**
- Development/staging environments
- Total traffic < 100 requests/sec
- Debugging specific issues

**Probabilistic Sampling When:**
- Production with moderate traffic (100-10K req/sec)
- Sample rate: 1-10% typically

**Tail-Based Sampling When:**
- High traffic production (>10K req/sec)
- Only sample errors and slow requests
- Requires tail-sampling processor (OpenTelemetry Collector)

## Anti-Patterns to Avoid

### ❌ Not Propagating Context

**WRONG: Breaking trace context**
```go
func processOrder(order Order) error {
    // Creates new context, loses trace!
    ctx := context.Background()
    return validateOrder(ctx, order)
}
```

**CORRECT: Propagate context through call chain**
```go
func processOrder(ctx context.Context, order Order) error {
    // Propagates trace context
    return validateOrder(ctx, order)
}
```

### ❌ Cardinality Explosion

**WRONG: Unbounded label values**
```go
// user_id can have millions of values!
httpRequests.WithLabelValues(r.Method, r.URL.Path, userID).Inc()
```

**CORRECT: Use bounded labels**
```go
// Only method and path (bounded values)
httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
// Track user-specific metrics separately if needed
```

### ❌ Logging Sensitive Data

**WRONG: Exposing PII and secrets**
```go
logger.Info("user login",
    "email", user.Email,        // PII!
    "password", user.Password,  // CRITICAL!
    "token", authToken,         // SECRET!
)
```

**CORRECT: Redact sensitive information**
```go
logger.Info("user login",
    "user_id", user.ID,  // Safe identifier
    "method", "password",
)
```

### ❌ Not Closing Spans

**WRONG: Span leaks memory**
```go
func processOrder(ctx context.Context) error {
    ctx, span := tracer.Start(ctx, "processOrder")
    // Missing defer span.End()!

    if err := validate(); err != nil {
        return err  // Span never closed!
    }

    return nil
}
```

**CORRECT: Always defer span.End()**
```go
func processOrder(ctx context.Context) error {
    ctx, span := tracer.Start(ctx, "processOrder")
    defer span.End()  // Always runs

    if err := validate(); err != nil {
        span.RecordError(err)
        return err
    }

    return nil
}
```

### ❌ Synchronous Metric Export

**WRONG: Blocking requests with metric export**
```go
// Synchronous export blocks HTTP handler
exporter := jaeger.New(jaeger.WithCollectorEndpoint(...))
tp := sdktrace.NewTracerProvider(
    sdktrace.WithSyncer(exporter),  // BAD: Synchronous!
)
```

**CORRECT: Use batching for async export**
```go
// Batching exports asynchronously
tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),  // GOOD: Async batching
)
```

### ❌ Missing Graceful Shutdown

**WRONG: Losing traces on shutdown**
```go
func main() {
    tp, _ := initTracer("service")
    // Missing shutdown - spans lost!
    http.ListenAndServe(":8080", nil)
}
```

**CORRECT: Shutdown exporters properly**
```go
func main() {
    tp, _ := initTracer("service")
    defer tp.Shutdown(context.Background())

    // Handle signals and graceful shutdown
    server.ListenAndServe()
}
```

## Best Practices

1. **Context Propagation**: Always pass `context.Context` through call chains
2. **Bounded Labels**: Keep metric label cardinality under 1000 combinations
3. **Sampling**: Use probabilistic sampling in high-traffic production
4. **Correlation IDs**: Include trace_id in logs for correlation
5. **Health Checks**: Implement both `/health` (liveness) and `/ready` (readiness)
6. **Graceful Shutdown**: Flush traces and metrics before exit
7. **Error Recording**: Use `span.RecordError()` for automatic error tracking
8. **Metric Naming**: Follow Prometheus naming conventions (`_total`, `_seconds`)
9. **Log Levels**: Use appropriate levels (Debug, Info, Warn, Error)
10. **Auto-Instrumentation**: Use middleware for HTTP/gRPC when possible

## Metric Naming Conventions

Follow Prometheus best practices:

**Counter Metrics** (always increasing):
- `http_requests_total` (not `http_requests`)
- `payment_transactions_total`
- `errors_total`

**Gauge Metrics** (can go up or down):
- `active_connections`
- `queue_size`
- `memory_usage_bytes`

**Histogram/Summary Metrics** (observations):
- `http_request_duration_seconds` (not `_milliseconds`)
- `db_query_duration_seconds`
- `response_size_bytes`

**Label Naming**:
- Use `method`, not `http_method`
- Use `status`, not `status_code` or `http_status`
- Use snake_case, not camelCase

## Resources

**Official Documentation:**
- OpenTelemetry Go: https://opentelemetry.io/docs/instrumentation/go/
- Prometheus Client Library: https://github.com/prometheus/client_golang
- Go slog Package: https://pkg.go.dev/log/slog

**Recent Guides (2025):**
- "Observability in Go: What Real Engineers Are Saying in 2025" (Quesma Blog)
- "Monitoring Go Apps with OpenTelemetry Metrics" (Better Stack, 2025)
- Prometheus Best Practices: https://prometheus.io/docs/practices/naming/

**Related Skills:**
- **golang-web-frameworks**: HTTP server patterns and middleware
- **golang-testing-strategies**: Testing instrumented code
- **verification-before-completion**: Validating observability setup

## Quick Reference

### Initialize OpenTelemetry
```go
tp, _ := initTracer("service-name")
defer tp.Shutdown(context.Background())
```

### Create Spans
```go
ctx, span := otel.Tracer("name").Start(ctx, "operation")
defer span.End()
span.SetAttributes(attribute.String("key", "value"))
```

### Define Metrics
```go
counter := promauto.NewCounterVec(opts, []string{"label"})
histogram := promauto.NewHistogramVec(opts, []string{"label"})
```

### Structured Logging
```go
logger := slog.With("trace_id", traceID)
logger.Info("message", "key", value)
```

### Health Checks
```go
http.HandleFunc("/health", livenessHandler)
http.HandleFunc("/ready", readinessHandler)
```

---

**Token Estimate**: ~5,000 tokens (entry point + full content)
**Version**: 1.0.0
**Last Updated**: 2025-12-03

Overview

This skill shows how to instrument Go services using OpenTelemetry for distributed tracing, Prometheus for metrics, and Go's slog for structured logging. It provides concrete middleware patterns, exporter setup, metric types, and context-aware logging examples to correlate traces, metrics, and logs. The material emphasizes production-ready concerns like health checks and graceful shutdown.

How this skill works

It configures an OpenTelemetry TracerProvider with a Jaeger exporter and provides examples for creating and propagating spans in application code and HTTP middleware. It defines Prometheus metric types (counters, gauges, histograms, summaries) and a middleware to record request duration, status, and active connections. It also shows using slog to emit JSON-structured logs that include trace and span identifiers, plus health and shutdown handlers for safe lifecycle management.

When to use it

You are instrumenting microservices for production observability.
You need distributed tracing across services (request flow and latency hotspots).
You want Prometheus metrics exposed via /metrics for scraping and dashboards.
You need structured, correlated logs that include trace IDs and span IDs.
You require readiness/liveness probes and graceful shutdown for Kubernetes deployments.

Best practices

Always propagate context when starting spans; prefer automatic middleware and manual spans for business logic.
Use probability sampling in high-traffic production; avoid AlwaysSample except for debugging.
Label metrics consistently (method, path, status) and avoid high-cardinality labels on frequently changing values.
Attach trace_id and span_id to logs using context-aware loggers to enable full triage across pillars.
Ensure exporters and background processors are cleanly shut down on SIGTERM to avoid losing telemetry data.

Example use cases

Instrument an HTTP API with otelhttp middleware and Prometheus metrics to measure latency and error rates.
Add span creation in business functions (e.g., ProcessOrder) to trace downstream calls and database queries.
Emit JSON logs via slog that include trace_id/span_id so logs can be filtered by trace in observability tools.
Expose /metrics and /health endpoints for Prometheus scraping and Kubernetes probes during rolling updates.
Use histograms for request duration buckets and summaries for database query quantiles to drive SLOs.

FAQ

How do I correlate logs with traces?

Inject trace and span IDs into your structured logs by creating a logger from the request context (extract span from context and add trace_id/span_id fields).

What sampling should I use in production?

Use a probability sampler (e.g., ParentBased with a target rate) to balance signal quality and cost; reserve AlwaysSample for short-lived debugging runs.