home / skills / a5c-ai / babysitter / arize-observability

This skill helps you monitor production ML models with drift detection, performance analysis, and actionable insights across cohorts.

npx playbooks add skill a5c-ai/babysitter --skill arize-observability

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
3.9 KB
---
name: arize-observability
description: Arize AI skill for production ML monitoring, embedding drift, and performance analysis.
allowed-tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
---

# arize-observability

## Overview

Arize AI skill for production ML monitoring, embedding drift detection, and comprehensive performance analysis.

## Capabilities

- Production data logging
- Embedding drift detection for NLP/CV models
- Performance monitoring dashboards
- Root cause analysis
- Slice and dice analysis for segments
- Bias monitoring
- A/B test monitoring
- Custom metrics and monitors

## Target Processes

- Model Performance Monitoring and Drift Detection
- ML System Observability and Incident Response
- Model Evaluation and Validation Framework

## Tools and Libraries

- Arize AI SDK
- pandas
- numpy

## Input Schema

```json
{
  "type": "object",
  "required": ["action"],
  "properties": {
    "action": {
      "type": "string",
      "enum": ["log", "monitor", "analyze", "alert-config", "compare"],
      "description": "Arize action to perform"
    },
    "logConfig": {
      "type": "object",
      "properties": {
        "modelId": { "type": "string" },
        "modelVersion": { "type": "string" },
        "modelType": { "type": "string", "enum": ["score_categorical", "regression", "ranking"] },
        "environment": { "type": "string", "enum": ["training", "validation", "production"] },
        "dataPath": { "type": "string" },
        "predictionIdColumn": { "type": "string" },
        "timestampColumn": { "type": "string" },
        "featureColumns": { "type": "array", "items": { "type": "string" } },
        "embeddingColumns": { "type": "array", "items": { "type": "string" } },
        "predictionColumn": { "type": "string" },
        "actualColumn": { "type": "string" }
      }
    },
    "monitorConfig": {
      "type": "object",
      "properties": {
        "metrics": { "type": "array", "items": { "type": "string" } },
        "thresholds": { "type": "object" },
        "schedule": { "type": "string" }
      }
    },
    "analysisConfig": {
      "type": "object",
      "properties": {
        "analysisType": { "type": "string", "enum": ["drift", "performance", "fairness", "data_quality"] },
        "timeRange": { "type": "object" },
        "segments": { "type": "array", "items": { "type": "string" } }
      }
    }
  }
}
```

## Output Schema

```json
{
  "type": "object",
  "required": ["status", "action"],
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "error"]
    },
    "action": {
      "type": "string"
    },
    "logId": {
      "type": "string"
    },
    "dashboardUrl": {
      "type": "string"
    },
    "analysis": {
      "type": "object",
      "properties": {
        "overallScore": { "type": "number" },
        "driftMetrics": { "type": "object" },
        "performanceMetrics": { "type": "object" },
        "topIssues": { "type": "array" },
        "recommendations": { "type": "array", "items": { "type": "string" } }
      }
    },
    "alerts": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "severity": { "type": "string" },
          "triggered": { "type": "boolean" }
        }
      }
    }
  }
}
```

## Usage Example

```javascript
{
  kind: 'skill',
  title: 'Log production predictions to Arize',
  skill: {
    name: 'arize-observability',
    context: {
      action: 'log',
      logConfig: {
        modelId: 'fraud-detector',
        modelVersion: '2.0.0',
        modelType: 'score_categorical',
        environment: 'production',
        dataPath: 'data/production_predictions.parquet',
        predictionIdColumn: 'request_id',
        timestampColumn: 'timestamp',
        featureColumns: ['amount', 'merchant_category', 'hour'],
        predictionColumn: 'fraud_probability',
        actualColumn: 'is_fraud'
      }
    }
  }
}
```

Overview

This skill connects models to Arize AI for production monitoring, embedding drift detection, and performance analysis. It streamlines logging, automated monitors, and slice-level investigations so teams can detect degradation, bias, and data issues quickly. Use it to centralize observability and generate actionable recommendations.

How this skill works

The skill ingests prediction data, features, embeddings, and ground truth into Arize via configurable log jobs. It configures monitors and runs analyses (drift, performance, fairness, data quality) on chosen time ranges and segments. Outputs include log identifiers, dashboard URLs, analysis summaries, and triggered alerts for rapid triage.

When to use it

  • Deploying a model to production and needing continuous performance monitoring
  • Detecting embedding or feature drift for NLP or CV models
  • Investigating model regressions or anomalies after new releases
  • Running slice-level fairness and bias checks on production traffic
  • Automating A/B test monitoring and guardrails for experiments

Best practices

  • Log consistent identifiers, timestamps, and schema across training and production
  • Include embeddings for NLP/CV models to enable drift and similarity analysis
  • Define clear alert thresholds and schedules aligned with business SLAs
  • Segment traffic by meaningful slices (region, cohort, device) before analysis
  • Store raw logs (or pointers) and maintain versioned model metadata for root cause analysis

Example use cases

  • Daily batch job that uploads production predictions, actuals, and embeddings to Arize
  • Scheduled drift analysis comparing the last 7 days to a production baseline
  • Configuring monitors for accuracy drop and triggering Slack alerts for incidents
  • Running fairness scans across demographic slices after a model update
  • Comparing A/B test model versions to surface performance regressions

FAQ

What inputs are required to log production data?

Provide modelId, modelVersion, environment, predictionId, timestamp, prediction and actual columns. Include feature and embedding columns when available.

Which analyses can the skill run?

It supports drift, performance, fairness, and data quality analyses, plus slice-level breakdowns and root cause summaries.

What outputs should I expect after an analysis?

You'll receive a status, action type, optional logId, dashboardUrl, an analysis object with scores and recommendations, and any alerts triggered.