home / skills / omer-metin / skills-for-antigravity / ai-observability

ai-observability skill

This skill helps you implement comprehensive LLM observability across tracing, cost tracking, RAG evaluation, and production monitoring to optimize quality and

npx playbooks add skill omer-metin/skills-for-antigravity --skill ai-observability

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.3 KB
---
name: ai-observability
description: Implement comprehensive observability for LLM applications including tracing (Langfuse/Helicone), cost tracking, token optimization, RAG evaluation metrics (RAGAS), hallucination detection, and production monitoring. Essential for debugging, optimizing costs, and ensuring AI output quality. Use when ", llm-monitoring, tracing, langfuse, helicone, cost-tracking, ragas, evaluation, hallucination-detection, prompt-caching" mentioned. 
---

# Ai Observability

## Identity



### Principles

- {'name': 'Trace Every LLM Call', 'description': 'Production AI apps without tracing are flying blind. Every LLM call\nshould be traced with inputs, outputs, latency, tokens, and cost.\nUse structured spans for multi-step chains and agents.\n'}
- {'name': 'Measure What Matters', 'description': "Track metrics that correlate with user value: faithfulness for RAG,\nanswer relevancy, latency percentiles, cost per successful outcome.\nVanity metrics (total calls) don't improve product quality.\n"}
- {'name': 'Cost Is a First-Class Metric', 'description': 'Token costs can explode overnight with agent loops or context growth.\nTrack cost per user, per feature, per model. Set budgets and alerts.\nPrompt caching can cut costs by 50-90%.\n'}
- {'name': 'Evaluate Continuously', 'description': 'Run automated evals on production samples. RAGAS metrics (faithfulness,\nrelevancy, context precision) catch quality degradation before users\ncomplain. Score > 0.8 is generally good.\n'}

## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill implements comprehensive observability for LLM applications, covering tracing (Langfuse/Helicone), cost tracking, token optimization, RAG evaluation (RAGAS), hallucination detection, and production monitoring. It equips teams to debug multi-step chains, control spend, and maintain output quality in production. The goal is clear metrics and actionable alerts so model drift or runaway costs are caught early.

How this skill works

The skill instruments every LLM call with structured traces: inputs, outputs, latency, token counts, and cost, and groups steps into spans for chains and agents. It integrates with tracing backends like Langfuse or Helicone, computes RAGAS metrics (faithfulness, relevancy, context precision), and runs automated evaluations on production samples. Cost telemetry aggregates by user, feature, and model, while prompt caching and token optimization routines recommend and apply reductions. Hallucination detectors flag likely ungrounded outputs and surface examples for labeling or automated rollback.

When to use it

  • When you need end-to-end tracing for chains, agents, or multi-step pipelines
  • When cost growth is unpredictable and you need per-user/feature spend tracking
  • When deploying RAG systems and you must measure faithfulness and relevancy
  • When hallucinations must be detected and reduced in production
  • When you want automated, continuous evaluation and alerting on quality regressions

Best practices

  • Trace every LLM call including tokens, prompts, completions, latency, and computed cost
  • Group spans for chains and agents to locate bottlenecks or looped behavior
  • Treat cost as a first-class metric: set budgets, per-feature limits, and alerts
  • Run automated RAGAS evaluations on sampled production traffic and aim for >0.8 where applicable
  • Use prompt caching to cut repeat-token costs and apply token optimization iteratively
  • Surface flagged hallucinations for human review and add failing cases to regression tests

Example use cases

  • Identify an agent loop causing exponential token spend by tracing nested spans and alerting on cost per minute
  • Measure RAG systems’ faithfulness over time with RAGAS dashboards and trigger re-indexing when context precision drops
  • Cut API costs by 50% using prompt caching and token reduction recommendations applied to hot prompts
  • Detect hallucinations in answers returned to users and automatically route examples to a triage queue
  • Monitor latency p50/p95 across models and escalate when tail latency impacts user flows

FAQ

How does tracing help reduce hallucinations?

Tracing ties outputs back to the exact prompt, context, and retrieval results so you can identify when missing or misleading context caused an ungrounded answer.

Can cost tracking be broken down by feature or user?

Yes. The skill aggregates token and API cost per user, feature, and model, enabling budgets and alerts at those levels.

What is RAGAS and why use it?

RAGAS is a set of RAG evaluation metrics (faithfulness, relevancy, context precision) designed to catch quality degradation early and quantify retrieval-augmented generation performance.