home / skills / orchestra-research / ai-research-skills / langsmith

langsmith skill

safe

This skill helps you debug, evaluate, and monitor LLM applications with LangSmith observability, capturing traces, datasets, and metrics for reliable AI

npx playbooks add skill orchestra-research/ai-research-skills --skill langsmith

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

9.5 KB

---
name: langsmith-observability
description: LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Observability, LangSmith, Tracing, Evaluation, Monitoring, Debugging, Testing, LLM Ops, Production]
dependencies: [langsmith>=0.2.0]
---

# LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

## When to use LangSmith

**Use LangSmith when:**
- Debugging LLM application issues (prompts, chains, agents)
- Evaluating model outputs systematically against datasets
- Monitoring production LLM systems
- Building regression testing for AI features
- Analyzing latency, token usage, and costs
- Collaborating on prompt engineering

**Key features:**
- **Tracing**: Capture inputs, outputs, latency for all LLM calls
- **Evaluation**: Systematic testing with built-in and custom evaluators
- **Datasets**: Create test sets from production traces or manually
- **Monitoring**: Track metrics, errors, and costs in production
- **Integrations**: Works with OpenAI, Anthropic, LangChain, LlamaIndex

**Use alternatives instead:**
- **Weights & Biases**: Deep learning experiment tracking, model training
- **MLflow**: General ML lifecycle, model registry focus
- **Arize/WhyLabs**: ML monitoring, data drift detection

## Quick start

### Installation

```bash
pip install langsmith

# Set environment variables
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_TRACING=true
```

### Basic tracing with @traceable

```python
from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Automatically traced to LangSmith
result = generate_response("What is machine learning?")
```

### OpenAI wrapper (automatic tracing)

```python
from langsmith.wrappers import wrap_openai
from openai import OpenAI

# Wrap client for automatic tracing
client = wrap_openai(OpenAI())

# All calls automatically traced
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)
```

## Core concepts

### Runs and traces

A **run** is a single execution unit (LLM call, chain, tool). Runs form hierarchical **traces** showing the full execution flow.

```python
from langsmith import traceable

@traceable(run_type="chain")
def process_query(query: str) -> str:
    # Parent run
    context = retrieve_context(query)  # Child run
    response = generate_answer(query, context)  # Child run
    return response

@traceable(run_type="retriever")
def retrieve_context(query: str) -> list:
    return vector_store.search(query)

@traceable(run_type="llm")
def generate_answer(query: str, context: list) -> str:
    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")
```

### Projects

Projects organize related runs. Set via environment or code:

```python
import os
os.environ["LANGSMITH_PROJECT"] = "my-project"

# Or per-function
@traceable(project_name="my-project")
def my_function():
    pass
```

## Client API

```python
from langsmith import Client

client = Client()

# List runs
runs = list(client.list_runs(
    project_name="my-project",
    filter='eq(status, "success")',
    limit=100
))

# Get run details
run = client.read_run(run_id="...")

# Create feedback
client.create_feedback(
    run_id="...",
    key="correctness",
    score=0.9,
    comment="Good answer"
)
```

## Datasets and evaluation

### Create dataset

```python
from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset("qa-test-set", description="QA evaluation")

# Add examples
client.create_examples(
    inputs=[
        {"question": "What is Python?"},
        {"question": "What is ML?"}
    ],
    outputs=[
        {"answer": "A programming language"},
        {"answer": "Machine learning"}
    ],
    dataset_id=dataset.id
)
```

### Run evaluation

```python
from langsmith import evaluate

def my_model(inputs: dict) -> dict:
    # Your model logic
    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[correctness_evaluator],
    experiment_prefix="v1"
)

print(f"Average score: {results.aggregate_metrics['correctness']}")
```

### Built-in evaluators

```python
from langsmith.evaluation import LangChainStringEvaluator

# Use LangChain evaluators
results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[
        LangChainStringEvaluator("qa"),
        LangChainStringEvaluator("cot_qa")
    ]
)
```

## Advanced tracing

### Tracing context

```python
from langsmith import tracing_context

with tracing_context(
    project_name="experiment-1",
    tags=["production", "v2"],
    metadata={"version": "2.0"}
):
    # All traceable calls inherit context
    result = my_function()
```

### Manual runs

```python
from langsmith import trace

with trace(
    name="custom_operation",
    run_type="tool",
    inputs={"query": "test"}
) as run:
    result = do_something()
    run.end(outputs={"result": result})
```

### Process inputs/outputs

```python
def sanitize_inputs(inputs: dict) -> dict:
    if "password" in inputs:
        inputs["password"] = "***"
    return inputs

@traceable(process_inputs=sanitize_inputs)
def login(username: str, password: str):
    return authenticate(username, password)
```

### Sampling

```python
import os
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling
```

## LangChain integration

```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Tracing enabled automatically with LANGSMITH_TRACING=true
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])

chain = prompt | llm

# All chain runs traced automatically
response = chain.invoke({"input": "Hello!"})
```

## Production monitoring

### Hub prompts

```python
from langsmith import Client

client = Client()

# Pull prompt from hub
prompt = client.pull_prompt("my-org/qa-prompt")

# Use in application
result = prompt.invoke({"question": "What is AI?"})
```

### Async client

```python
from langsmith import AsyncClient

async def main():
    client = AsyncClient()

    runs = []
    async for run in client.list_runs(project_name="my-project"):
        runs.append(run)

    return runs
```

### Feedback collection

```python
from langsmith import Client

client = Client()

# Collect user feedback
def record_feedback(run_id: str, user_rating: int, comment: str = None):
    client.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=user_rating / 5.0,  # Normalize to 0-1
        comment=comment
    )

# In your application
record_feedback(run_id="...", user_rating=4, comment="Helpful response")
```

## Testing integration

### Pytest integration

```python
from langsmith import test

@test
def test_qa_accuracy():
    result = my_qa_function("What is Python?")
    assert "programming" in result.lower()
```

### Evaluation in CI/CD

```python
from langsmith import evaluate

def run_evaluation():
    results = evaluate(
        my_model,
        data="regression-test-set",
        evaluators=[accuracy_evaluator]
    )

    # Fail CI if accuracy drops
    assert results.aggregate_metrics["accuracy"] >= 0.9, \
        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"
```

## Best practices

1. **Structured naming** - Use consistent project/run naming conventions
2. **Add metadata** - Include version, environment, user info
3. **Sample in production** - Use sampling rate to control volume
4. **Create datasets** - Build test sets from interesting production cases
5. **Automate evaluation** - Run evaluations in CI/CD pipelines
6. **Monitor costs** - Track token usage and latency trends

## Common issues

**Traces not appearing:**
```python
import os
# Ensure tracing is enabled
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-key"

# Verify connection
from langsmith import Client
client = Client()
print(client.list_projects())  # Should work
```

**High latency from tracing:**
```python
# Enable background batching (default)
from langsmith import Client
client = Client(auto_batch_tracing=True)

# Or use sampling
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"
```

**Large payloads:**
```python
# Hide sensitive/large fields
@traceable(
    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
)
def my_function(data):
    pass
```

## References

- **[Advanced Usage](references/advanced-usage.md)** - Custom evaluators, distributed tracing, hub prompts
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, performance

## Resources

- **Documentation**: https://docs.smith.langchain.com
- **Python SDK**: https://github.com/langchain-ai/langsmith-sdk
- **Web App**: https://smith.langchain.com
- **Version**: 0.2.0+
- **License**: MIT

Overview

This skill provides an LLM observability platform for tracing, evaluation, and monitoring of AI applications. It captures hierarchical runs and traces, organizes them into projects and datasets, and supports automated evaluation and feedback. Use it to debug prompts and chains, monitor production systems, and build regression tests for AI features.

How this skill works

The skill instruments LLM calls, chains, and tools using decorators and client wrappers to automatically record inputs, outputs, latency, and metadata into traceable runs. It exposes a Client and AsyncClient API to list and inspect runs, create feedback, and manage datasets. Built-in evaluators and an evaluate function run systematic tests against datasets and aggregate metrics for CI/CD integration.

When to use it

Debugging prompt chains, agents, or unexpected model behavior
Evaluating model outputs systematically against labeled datasets
Monitoring production LLM systems for errors, latency, and cost
Creating regression tests and automated evaluations in CI/CD
Sampling production traffic to build realistic test datasets

Best practices

Use structured project and run naming to make traces searchable
Attach metadata (version, environment, user info) to runs for context
Sample production traces to limit volume and control costs
Sanitize or redact sensitive/large fields before tracing
Automate evaluation runs in CI and fail builds on metric regression

Example use cases

Wrap your OpenAI or Anthropic client to auto-trace all chat calls and inspect latency spikes
Create a QA dataset from production traces and run nightly accuracy evaluations
Instrument a retrieval-augmented chain with traceable decorators to debug context retrieval
Collect user feedback per run and convert ratings into normalized feedback metrics
Set up sampling and monitoring dashboards to track token usage and cost trends

FAQ

How do I enable automatic tracing?

Set the LANGSMITH_TRACING environment variable to true and optionally wrap your client with provided wrappers for automatic instrumentation.

Can I evaluate models in CI/CD?

Yes. Use the evaluate function with datasets and evaluators, then assert aggregate metrics to gate CI jobs.