home / skills / yonatangross / orchestkit / langfuse-observability

langfuse-observability skill

safe

/plugins/ork/skills/langfuse-observability

This skill helps you set up Langfuse observability, monitor LLM costs, and manage prompts for production with tracing and evaluation.

npx playbooks add skill yonatangross/orchestkit --skill langfuse-observability

Review the files below or copy the command above to add this skill to your agents.

Files (12)

SKILL.md

9.2 KB

---
name: langfuse-observability
description: LLM observability platform for tracing, evaluation, prompt management, and cost tracking. Use when setting up Langfuse, monitoring LLM costs, tracking token usage, or implementing prompt versioning.
context: fork
agent: metrics-architect
version: 1.0.0
author: OrchestKit AI Agent Hub
tags: [langfuse, llm, observability, tracing, evaluation, prompts, 2026]
user-invocable: false
---

# Langfuse Observability

## Overview

**Langfuse** is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.

**When to use this skill:**
- Setting up LLM observability from scratch
- Debugging slow or incorrect LLM responses
- Tracking token usage and costs
- Managing prompts in production
- Evaluating LLM output quality
- Migrating from LangSmith to Langfuse

**OrchestKit Integration:**
- **Status**: Migrated from LangSmith (Dec 2025)
- **Location**: `backend/app/shared/services/langfuse/`
- **MCP Server**: `orchestkit-langfuse` (optional)

---

## Quick Start

### Setup

```python
# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings

langfuse_client = Langfuse(
    public_key=settings.LANGFUSE_PUBLIC_KEY,
    secret_key=settings.LANGFUSE_SECRET_KEY,
    host=settings.LANGFUSE_HOST  # Self-hosted or cloud
)
```

### Basic Tracing with @observe

```python
from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)
```

### Session & User Tracking

```python
langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)
```

---

## Core Features Summary

| Feature | Description | Reference |
|---------|-------------|-----------|
| Distributed Tracing | Track LLM calls with parent-child spans | `references/tracing-setup.md` |
| Cost Tracking | Automatic token & cost calculation | `references/cost-tracking.md` |
| Prompt Management | Version control for prompts | `references/prompt-management.md` |
| LLM Evaluation | Custom scoring with G-Eval | `references/evaluation-scores.md` |
| Session Tracking | Group related traces | `references/session-tracking.md` |
| Experiments API | A/B testing & benchmarks | `references/experiments-api.md` |
| Multi-Judge Eval | Ensemble LLM evaluation | `references/multi-judge-evaluation.md` |

---

## References

### Tracing Setup
**See: `references/tracing-setup.md`**

Key topics covered:
- Initializing Langfuse client with @observe decorator
- Creating nested traces and spans
- Tracking LLM generations with metadata
- LangChain/LangGraph CallbackHandler integration
- Workflow integration patterns

### Cost Tracking
**See: `references/cost-tracking.md`**

Key topics covered:
- Automatic cost calculation from token usage
- Custom model pricing configuration
- Monitoring dashboard SQL queries
- Cost tracking per analysis/user
- Daily cost trend analysis

### Prompt Management
**See: `references/prompt-management.md`**

Key topics covered:
- Prompt versioning and labels (production/staging/draft)
- Template variables with Jinja2 syntax
- A/B testing prompt versions
- OrchestKit 4-level caching architecture (L1-L4)
- Linking prompts to generation spans

### LLM Evaluation
**See: `references/evaluation-scores.md`**

Key topics covered:
- Custom scoring with numeric/categorical values
- G-Eval automated quality assessment
- Score trends and comparisons
- Filtering traces by score thresholds

### Session Tracking
**See: `references/session-tracking.md`**

Key topics covered:
- Grouping traces by session_id
- Multi-turn conversation tracking
- User and metadata analytics

### Experiments API
**See: `references/experiments-api.md`**

Key topics covered:
- Creating test datasets in Langfuse
- Running automated evaluations
- Regression testing for LLMs
- Benchmarking prompt versions

### Multi-Judge Evaluation
**See: `references/multi-judge-evaluation.md`**

Key topics covered:
- Multiple LLM judges for quality assessment
- Weighted scoring across judges
- OrchestKit langfuse_evaluators.py integration

---

## Best Practices

1. **Always use @observe decorator** for automatic tracing
2. **Set user_id and session_id** for better analytics
3. **Add meaningful metadata** (content_type, analysis_id, etc.)
4. **Score all production traces** for quality monitoring
5. **Use prompt management** instead of hardcoded prompts
6. **Monitor costs daily** to catch spikes early
7. **Create datasets** for regression testing
8. **Tag production vs staging** traces

---

## LangSmith Migration Notes

**Key Differences:**
| Aspect | Langfuse | LangSmith |
|--------|----------|-----------|
| Hosting | Self-hosted, open-source | Cloud-only, proprietary |
| Cost | Free | Paid |
| Prompts | Built-in management | External storage needed |
| Decorator | `@observe` | `@traceable` |

---

## External References

- [Langfuse Docs](https://langfuse.com/docs)
- [Python SDK](https://langfuse.com/docs/sdk/python)
- [Decorators Guide](https://langfuse.com/docs/sdk/python/decorators)
- [Prompt Management](https://langfuse.com/docs/prompts)
- [Self-Hosting](https://langfuse.com/docs/deployment/self-host)

---

## Related Skills

- `observability-monitoring` - General observability patterns for metrics, logging, and alerting
- `llm-evaluation` - Evaluation patterns that integrate with Langfuse scoring
- `llm-streaming` - Streaming response patterns with trace instrumentation
- `prompt-caching` - Caching strategies that reduce costs tracked by Langfuse

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Observability platform | Langfuse (not LangSmith) | Open-source, self-hosted, free, built-in prompt management |
| Tracing approach | @observe decorator | Automatic, low-overhead instrumentation |
| Cost tracking | Automatic token counting | Built-in model pricing with custom overrides |
| Prompt management | Langfuse native | Version control, A/B testing, labels in one place |

## Capability Details

### distributed-tracing
**Keywords:** trace, tracing, observability, span, nested, parent-child, observe
**Solves:**
- How do I trace LLM calls across my application?
- How to debug slow LLM responses?
- Track execution flow in multi-agent workflows
- Create nested trace spans

### cost-tracking
**Keywords:** cost, token usage, pricing, budget, spend, expense
**Solves:**
- How do I track LLM costs?
- Calculate token usage and pricing
- Monitor AI budget and spending
- Track cost per user or session

### prompt-management
**Keywords:** prompt version, prompt template, prompt control, prompt registry
**Solves:**
- How do I version control prompts?
- Manage prompts in production
- A/B test different prompt versions
- Link prompts to traces

### llm-evaluation
**Keywords:** score, quality, evaluation, rating, assessment, g-eval
**Solves:**
- How do I evaluate LLM output quality?
- Score responses with custom metrics
- Track quality trends over time
- Compare prompt versions by quality

### session-tracking
**Keywords:** session, user tracking, conversation, group traces
**Solves:**
- How do I group related traces?
- Track multi-turn conversations
- Monitor per-user performance
- Organize traces by session

### langchain-integration
**Keywords:** langchain, callback, handler, langgraph integration
**Solves:**
- How do I integrate Langfuse with LangChain?
- Use CallbackHandler for tracing
- Automatic LangGraph workflow tracing
- LangChain observability setup

### datasets-evaluation
**Keywords:** dataset, test set, evaluation dataset, benchmark
**Solves:**
- How do I create test datasets in Langfuse?
- Run automated evaluations
- Regression testing for LLMs
- Benchmark prompt versions

### ab-testing
**Keywords:** a/b test, experiment, compare prompts, variant testing
**Solves:**
- How do I A/B test prompts?
- Compare two prompt versions
- Experimental prompt evaluation
- Statistical prompt testing

### monitoring-dashboard
**Keywords:** dashboard, analytics, metrics, monitoring, queries
**Solves:**
- What are the most expensive traces?
- Average cost by agent type
- Quality score trends
- Custom monitoring queries

### orchestkit-integration
**Keywords:** orchestkit, migration, setup, workflow integration
**Solves:**
- How does OrchestKit use Langfuse?
- Migrate from LangSmith to Langfuse
- OrchestKit workflow tracing patterns
- Cost tracking per analysis

### multi-judge-evaluation
**Keywords:** multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring
**Solves:**
- How do I use multiple LLM judges to evaluate quality?
- Set up G-Eval criteria evaluation
- Configure weighted scoring across judges
- Wire OrchestKit's existing langfuse_evaluators.py

### experiments-api
**Keywords:** experiment, dataset, benchmark, regression test, prompt testing
**Solves:**
- How do I run experiments across datasets?
- A/B test models and prompts systematically
- Track quality regression over time
- Compare experiment results

Overview

This skill integrates Langfuse as an LLM observability platform for tracing, evaluation, prompt management, and cost tracking. It describes how to instrument applications, track sessions and users, version prompts, and monitor token costs. The content targets production-ready integration patterns and migration guidance from LangSmith.

How this skill works

The skill explains initializing the Langfuse client and using the @observe decorator to automatically create traces and nested spans for LLM calls. It covers session and user tagging, metadata enrichment, token and cost calculation, prompt versioning, and evaluation workflows (including G-Eval and multi-judge setups). It also outlines integration points for LangChain/LangGraph and experiment APIs for A/B testing.

When to use it

Setting up LLM observability and distributed tracing from scratch
Debugging slow, costly, or incorrect LLM responses in production
Tracking token usage and calculating per-user or per-session costs
Managing prompt versions, labels, and A/B testing in production
Migrating existing observability from LangSmith to a self-hosted Langfuse solution

Best practices

Instrument functions with @observe for low-overhead automatic tracing
Always set user_id and session_id to enable session analytics and per-user cost tracking
Attach meaningful metadata (content_type, analysis_id, tags) to traces for filtering
Version and label prompts (production/staging/draft) instead of hardcoding templates
Score production traces and create evaluation datasets for regression testing
Monitor token and cost metrics daily and configure custom model pricing where needed

Example use cases

Trace a multi-agent workflow to identify the slowest LLM call and its parent span
Track token usage and daily cost per user to detect budget spikes
Register and A/B test two prompt versions tied to evaluation scores and experiment results
Group multi-turn conversations by session_id to analyze conversation quality trends
Run automated regression tests with saved datasets and compare model/prompt performance over time

FAQ

Do I need a cloud Langfuse account or can I self-host?

Langfuse supports self-hosting; the skill emphasizes self-hosted deployment which is free and production-friendly compared to cloud-only alternatives.

How do I track costs for custom model pricing?

Configure model pricing overrides in Langfuse and enable automatic token counting; dashboards and SQL queries can then surface per-user or per-analysis spend.