home / skills / a5c-ai / babysitter / opentelemetry-llm

This skill helps implement OpenTelemetry instrumentation for LLM apps to enable distributed tracing, with config, exporters, and semantic conventions.

npx playbooks add skill a5c-ai/babysitter --skill opentelemetry-llm

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
1.3 KB
---
name: opentelemetry-llm
description: OpenTelemetry instrumentation for LLM applications with distributed tracing
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
---

# OpenTelemetry LLM Skill

## Capabilities

- Configure OpenTelemetry SDK for LLM apps
- Implement LLM-specific instrumentation
- Set up trace exporters (Jaeger, OTLP)
- Design semantic conventions for LLM
- Configure span attributes for AI workloads
- Implement context propagation

## Target Processes

- llm-observability-monitoring
- agent-deployment-pipeline

## Implementation Details

### Core Components

1. **TracerProvider**: SDK configuration
2. **SpanProcessor**: Batch/simple processors
3. **Exporters**: Jaeger, OTLP, Console
4. **Instrumentation**: Auto and manual

### LLM Semantic Conventions

- gen_ai.system (OpenAI, Anthropic)
- gen_ai.request.model
- gen_ai.request.max_tokens
- gen_ai.response.finish_reason
- gen_ai.usage.prompt_tokens

### Configuration Options

- Exporter selection
- Sampling strategies
- Resource attributes
- Span limits
- Context propagation

### Best Practices

- Consistent attribute naming
- Appropriate sampling
- Error handling traces
- Propagate context across services

### Dependencies

- opentelemetry-sdk
- opentelemetry-exporter-*
- openinference (optional)

Overview

This skill provides OpenTelemetry instrumentation tailored for large language model (LLM) applications to enable distributed tracing and observability. It configures the SDK, supplies LLM-specific semantic conventions, and wires exporters like Jaeger or OTLP for collecting traces. The goal is clear, consistent traces across agent orchestration and LLM workloads to improve debugging, performance tuning, and reliability.

How this skill works

The skill sets up a TracerProvider, span processors, and selectable exporters, then instruments LLM call paths with both auto and manual spans. It injects LLM-centric attributes (model, tokens, finish reason, system) into spans and supports context propagation across services. Sampling, span limits, and resource attributes are configurable so traces remain high-value and cost-aware.

When to use it

  • You need end-to-end traces for LLM requests across microservices or agent pipelines.
  • When diagnosing latency or token-cost hotspots in production LLM systems.
  • To monitor reliability and error patterns in agent orchestration or resumable workflows.
  • Before deploying new LLM models to compare performance and usage metrics.
  • When you want standardized LLM semantic attributes for downstream observability tools.

Best practices

  • Use consistent attribute naming following the provided gen_ai.* conventions.
  • Choose sampling and span limits that balance observability with export costs.
  • Attach model, prompt/response token counts, and finish reason to relevant spans.
  • Propagate context across async boundaries and external agent calls.
  • Export to a scalable backend (OTLP or Jaeger) and validate traces in staging first.

Example use cases

  • Add tracing to an agent deployment pipeline to surface bottlenecks during orchestration.
  • Instrument Claude or other LLM calls to correlate token usage with latency and errors.
  • Aggregate traces across a babysitter-style orchestrator to debug resumable workflows.
  • Use traces to compare different sampling strategies and their effect on observability cost.
  • Attach LLM-specific spans to alert on abnormal finish_reason values or token spikes.

FAQ

Which exporters are supported?

Jaeger, OTLP, and Console exporters are supported; choose OTLP for production backends.

Do I need manual instrumentation?

Auto-instrumentation helps, but manual spans are recommended for high-level operations like orchestration steps and prompt construction.