home / skills / jeremylongshore / claude-code-plugins-plus-skills / speak-observability

This skill helps you implement comprehensive observability for Speak operations by configuring metrics, traces, and alerts across dashboards.

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill speak-observability

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
11.5 KB
---
name: speak-observability
description: |
  Set up comprehensive observability for Speak integrations with metrics, traces, and alerts.
  Use when implementing monitoring for Speak operations, setting up dashboards,
  or configuring alerting for language learning feature health.
  Trigger with phrases like "speak monitoring", "speak metrics",
  "speak observability", "monitor speak", "speak alerts", "speak tracing".
allowed-tools: Read, Write, Edit
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# Speak Observability

## Overview
Set up comprehensive observability for Speak language learning integrations.

## Prerequisites
- Prometheus or compatible metrics backend
- OpenTelemetry SDK installed
- Grafana or similar dashboarding tool
- AlertManager configured

## Key Metrics for Language Learning

### Business Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `speak_lessons_started_total` | Counter | Lessons initiated |
| `speak_lessons_completed_total` | Counter | Lessons completed |
| `speak_lessons_abandoned_total` | Counter | Lessons abandoned mid-session |
| `speak_pronunciation_score` | Histogram | Pronunciation scores distribution |
| `speak_active_sessions` | Gauge | Currently active lesson sessions |
| `speak_daily_active_learners` | Gauge | Unique learners today |

### Technical Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `speak_api_requests_total` | Counter | Total API requests |
| `speak_api_duration_seconds` | Histogram | Request latency |
| `speak_api_errors_total` | Counter | Error count by type |
| `speak_speech_recognition_duration_seconds` | Histogram | Audio processing time |
| `speak_rate_limit_remaining` | Gauge | Rate limit headroom |

## Prometheus Metrics Implementation

```typescript
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// Business metrics
const lessonsStarted = new Counter({
  name: 'speak_lessons_started_total',
  help: 'Total lessons started',
  labelNames: ['language', 'topic', 'difficulty'],
  registers: [registry],
});

const lessonsCompleted = new Counter({
  name: 'speak_lessons_completed_total',
  help: 'Total lessons completed',
  labelNames: ['language', 'topic', 'difficulty'],
  registers: [registry],
});

const pronunciationScore = new Histogram({
  name: 'speak_pronunciation_score',
  help: 'Pronunciation scores distribution',
  labelNames: ['language'],
  buckets: [40, 50, 60, 70, 80, 85, 90, 95, 100],
  registers: [registry],
});

const activeSessions = new Gauge({
  name: 'speak_active_sessions',
  help: 'Currently active lesson sessions',
  labelNames: ['language'],
  registers: [registry],
});

// Technical metrics
const apiRequests = new Counter({
  name: 'speak_api_requests_total',
  help: 'Total Speak API requests',
  labelNames: ['method', 'endpoint', 'status'],
  registers: [registry],
});

const apiDuration = new Histogram({
  name: 'speak_api_duration_seconds',
  help: 'Speak API request duration',
  labelNames: ['method', 'endpoint'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry],
});

const speechRecognitionDuration = new Histogram({
  name: 'speak_speech_recognition_duration_seconds',
  help: 'Speech recognition processing time',
  labelNames: ['language'],
  buckets: [0.5, 1, 2, 3, 5, 10],
  registers: [registry],
});
```

## Instrumented Speak Client

```typescript
async function instrumentedSpeakRequest<T>(
  method: string,
  endpoint: string,
  operation: () => Promise<T>
): Promise<T> {
  const timer = apiDuration.startTimer({ method, endpoint });

  try {
    const result = await operation();
    apiRequests.inc({ method, endpoint, status: 'success' });
    return result;
  } catch (error: any) {
    apiRequests.inc({ method, endpoint, status: 'error' });
    throw error;
  } finally {
    timer();
  }
}

// Lesson tracking
async function trackLessonStart(session: LessonSession): Promise<void> {
  lessonsStarted.inc({
    language: session.language,
    topic: session.topic,
    difficulty: session.difficulty,
  });
  activeSessions.inc({ language: session.language });
}

async function trackLessonEnd(
  session: LessonSession,
  summary: SessionSummary
): Promise<void> {
  const labels = {
    language: session.language,
    topic: session.topic,
    difficulty: session.difficulty,
  };

  if (summary.completed) {
    lessonsCompleted.inc(labels);
  } else {
    lessonsAbandoned.inc({
      ...labels,
      abandonReason: summary.abandonReason || 'unknown',
    });
  }

  activeSessions.dec({ language: session.language });

  // Record pronunciation score
  if (summary.averagePronunciationScore) {
    pronunciationScore.observe(
      { language: session.language },
      summary.averagePronunciationScore
    );
  }
}
```

## Distributed Tracing

### OpenTelemetry Setup
```typescript
import { trace, SpanStatusCode, context, propagation } from '@opentelemetry/api';

const tracer = trace.getTracer('speak-service');

async function tracedSpeakCall<T>(
  operationName: string,
  attributes: Record<string, string>,
  operation: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(`speak.${operationName}`, async (span) => {
    span.setAttributes(attributes);

    try {
      const result = await operation();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error: any) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Trace a complete lesson session
async function tracedLessonSession(
  userId: string,
  config: LessonConfig
): Promise<SessionSummary> {
  return tracedSpeakCall(
    'lesson.session',
    {
      'user.id': userId,
      'lesson.language': config.language,
      'lesson.topic': config.topic,
    },
    async () => {
      const session = await speakService.startSession(config);

      // Create child spans for each exchange
      while (!session.isComplete) {
        await tracedSpeakCall(
          'lesson.exchange',
          { 'session.id': session.id },
          async () => {
            const prompt = await session.getPrompt();
            const response = await getUserResponse();
            return session.submitResponse(response);
          }
        );
      }

      return session.getSummary();
    }
  );
}
```

## Structured Logging

```typescript
import pino from 'pino';

const logger = pino({
  name: 'speak-service',
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
});

interface SpeakLogContext {
  service: 'speak';
  operation: string;
  sessionId?: string;
  userId?: string;
  language?: string;
  duration_ms?: number;
  pronunciationScore?: number;
  error?: any;
}

function logSpeakOperation(
  level: 'info' | 'warn' | 'error',
  operation: string,
  context: Partial<SpeakLogContext>
): void {
  const logContext: SpeakLogContext = {
    service: 'speak',
    operation,
    ...context,
  };

  logger[level](logContext, `Speak ${operation}`);
}

// Usage
logSpeakOperation('info', 'lesson.completed', {
  sessionId: session.id,
  userId: session.userId,
  language: session.language,
  duration_ms: summary.duration,
  pronunciationScore: summary.averagePronunciationScore,
});
```

## Alert Configuration

### Prometheus AlertManager Rules
```yaml
# speak_alerts.yaml
groups:
  - name: speak_alerts
    rules:
      # High error rate
      - alert: SpeakHighErrorRate
        expr: |
          rate(speak_api_errors_total[5m]) /
          rate(speak_api_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
          service: speak
        annotations:
          summary: "Speak API error rate > 5%"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Speech recognition slow
      - alert: SpeakSpeechRecognitionSlow
        expr: |
          histogram_quantile(0.95,
            rate(speak_speech_recognition_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
          service: speak
        annotations:
          summary: "Speech recognition P95 latency > 5s"

      # Low lesson completion rate
      - alert: SpeakLowCompletionRate
        expr: |
          rate(speak_lessons_completed_total[1h]) /
          rate(speak_lessons_started_total[1h]) < 0.5
        for: 30m
        labels:
          severity: warning
          service: speak
        annotations:
          summary: "Lesson completion rate < 50%"

      # Speak API down
      - alert: SpeakAPIDown
        expr: up{job="speak"} == 0
        for: 1m
        labels:
          severity: critical
          service: speak
        annotations:
          summary: "Speak API is unreachable"

      # Pronunciation scores dropping
      - alert: SpeakPronunciationScoresDrop
        expr: |
          avg(speak_pronunciation_score) < 60
        for: 1h
        labels:
          severity: info
          service: speak
        annotations:
          summary: "Average pronunciation scores below 60%"
```

## Grafana Dashboard

```json
{
  "title": "Speak Language Learning",
  "panels": [
    {
      "title": "Active Lesson Sessions",
      "type": "stat",
      "targets": [{
        "expr": "sum(speak_active_sessions)"
      }]
    },
    {
      "title": "Lessons Started vs Completed",
      "type": "graph",
      "targets": [
        { "expr": "rate(speak_lessons_started_total[5m])", "legendFormat": "Started" },
        { "expr": "rate(speak_lessons_completed_total[5m])", "legendFormat": "Completed" }
      ]
    },
    {
      "title": "Pronunciation Score Distribution",
      "type": "heatmap",
      "targets": [{
        "expr": "sum by (le) (rate(speak_pronunciation_score_bucket[5m]))"
      }]
    },
    {
      "title": "API Latency P50/P95/P99",
      "type": "graph",
      "targets": [
        { "expr": "histogram_quantile(0.5, rate(speak_api_duration_seconds_bucket[5m]))", "legendFormat": "P50" },
        { "expr": "histogram_quantile(0.95, rate(speak_api_duration_seconds_bucket[5m]))", "legendFormat": "P95" },
        { "expr": "histogram_quantile(0.99, rate(speak_api_duration_seconds_bucket[5m]))", "legendFormat": "P99" }
      ]
    },
    {
      "title": "Speech Recognition Duration",
      "type": "graph",
      "targets": [{
        "expr": "histogram_quantile(0.95, rate(speak_speech_recognition_duration_seconds_bucket[5m]))"
      }]
    },
    {
      "title": "Error Rate by Type",
      "type": "graph",
      "targets": [{
        "expr": "sum by (error_type) (rate(speak_api_errors_total[5m]))"
      }]
    }
  ]
}
```

## Output
- Business and technical metrics
- Distributed tracing configured
- Structured logging implemented
- Alert rules deployed
- Grafana dashboard ready

## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Missing metrics | No instrumentation | Wrap client calls |
| Trace gaps | Missing propagation | Check context headers |
| Alert storms | Wrong thresholds | Tune alert rules |
| High cardinality | Too many labels | Reduce label values |

## Examples

### Quick Metrics Endpoint
```typescript
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});
```

## Resources
- [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/)
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
- [Speak Observability Guide](https://developer.speak.com/docs/observability)

## Next Steps
For incident response, see `speak-incident-runbook`.

Overview

This skill sets up comprehensive observability for Speak language-learning integrations, covering metrics, traces, logs, alerts, and dashboards. It provides ready-to-use Prometheus metrics, OpenTelemetry tracing patterns, structured logging conventions, Alertmanager rules, and Grafana dashboard panels. Use it to monitor feature health, surface regressions, and drive incident response for Speak features.

How this skill works

The skill defines business and technical Prometheus metrics (counters, gauges, histograms) and exposes them via a metrics endpoint for scraping. It instruments requests and lesson flows with timers and labels, wraps operations in OpenTelemetry spans for distributed tracing, and emits structured JSON logs for correlation. Prebuilt alerting rules detect high error rates, latency spikes, low completion, and score drops, while a Grafana dashboard visualizes key signals.

When to use it

  • When deploying Speak integrations to production and you need observability coverage
  • When setting up dashboards for product and engineering to track learning metrics
  • When configuring alerting for service reliability or feature regressions
  • When instrumenting API clients, speech recognition, or lesson lifecycle code
  • When preparing runbooks and incident response for Speak services

Best practices

  • Limit label cardinality: use coarse values for user-facing labels to avoid high cardinality in Prometheus
  • Record business metrics (lessons started/completed/abandoned) with meaningful labels (language, topic, difficulty)
  • Instrument API calls with timers and status labels to calculate error rates and latency percentiles
  • Create child spans for lesson exchanges to trace user flows end-to-end and propagate context
  • Tune alert thresholds and use for/duration windows to avoid alert storms; iterate on severity and runbooks

Example use cases

  • Track lesson funnel: started → completed → abandoned and alert if completion falls below 50%
  • Surface speech recognition regressions by alerting on P95 latency > 5s for recognition processing
  • Correlate high API error rate with recent deployments using traces and structured logs
  • Power dashboards that show active sessions, pronunciation score distributions, and latency P50/P95/P99
  • Automate incident notification from Alertmanager to on-call with relevant Grafana links and trace IDs

FAQ

What backends are required?

Prometheus-compatible metrics backend, OpenTelemetry collectors, and a dashboard tool like Grafana; Alertmanager for alerts.

How do I avoid high cardinality from labels?

Limit labels to stable dimensions (language, topic bucket, difficulty tier), avoid user IDs in metrics, and use logs/traces for per-user details.