home / skills / jeremylongshore / claude-code-plugins-plus-skills / databricks-prod-checklist

databricks-prod-checklist skill

safe

/plugins/saas-packs/databricks-pack/skills/databricks-prod-checklist

This skill guides Databricks production deployment with a complete checklist, governance, and rollback steps to ensure a safe go-live.

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill databricks-prod-checklist

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

7.5 KB

---
name: databricks-prod-checklist
description: |
  Execute Databricks production deployment checklist and rollback procedures.
  Use when deploying Databricks jobs to production, preparing for launch,
  or implementing go-live procedures.
  Trigger with phrases like "databricks production", "deploy databricks",
  "databricks go-live", "databricks launch checklist".
allowed-tools: Read, Bash(databricks:*), Grep
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---

# Databricks Production Checklist

## Overview
Complete checklist for deploying Databricks jobs and pipelines to production.

## Prerequisites
- Staging environment tested and verified
- Production workspace access
- Unity Catalog configured
- Monitoring and alerting ready

## Instructions

### Step 1: Pre-Deployment Configuration

#### Security
- [ ] Service principal configured for automation
- [ ] Secrets in Databricks Secret Scopes (not env vars)
- [ ] Token expiration set (max 90 days)
- [ ] Unity Catalog permissions configured
- [ ] Cluster policies enforced
- [ ] IP access lists configured

#### Infrastructure
- [ ] Production cluster pool configured
- [ ] Instance types validated for workload
- [ ] Autoscaling configured appropriately
- [ ] Spot instance ratio set (cost vs reliability)

### Step 2: Code Quality Verification

#### Testing
- [ ] Unit tests passing
- [ ] Integration tests on staging data
- [ ] Data quality tests defined
- [ ] Performance benchmarks met

```bash
# Run tests via Asset Bundles
databricks bundle validate -t prod
databricks bundle run -t staging test-job

# Verify test results
databricks runs get --run-id $RUN_ID | jq '.state.result_state'
```

#### Code Review
- [ ] No hardcoded credentials
- [ ] Error handling covers all failure modes
- [ ] Logging is production-appropriate
- [ ] Delta Lake best practices followed
- [ ] No `collect()` on large datasets

### Step 3: Job Configuration

```yaml
# resources/prod_job.yml
resources:
  jobs:
    etl_pipeline:
      name: "prod-etl-pipeline"
      tags:
        environment: production
        team: data-engineering
        cost_center: analytics

      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "America/New_York"

      email_notifications:
        on_failure:
          - "[email protected]"
        on_success:
          - "[email protected]"

      webhook_notifications:
        on_failure:
          - id: "slack-webhook-id"

      max_concurrent_runs: 1
      timeout_seconds: 14400  # 4 hours

      tasks:
        - task_key: bronze_ingest
          job_cluster_key: etl_cluster
          notebook_task:
            notebook_path: /Repos/prod/pipelines/bronze
          timeout_seconds: 3600

        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingest
          job_cluster_key: etl_cluster
          notebook_task:
            notebook_path: /Repos/prod/pipelines/silver

      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 4
            autoscale:
              min_workers: 2
              max_workers: 8
            spark_conf:
              spark.sql.shuffle.partitions: "200"
              spark.databricks.delta.optimizeWrite.enabled: "true"
            instance_pool_id: "prod-pool-id"
```

### Step 4: Deployment Commands

```bash
# Pre-flight checks
echo "=== Pre-flight Checks ==="
databricks workspace list /Repos/prod/  # Verify repo exists
databricks clusters list | grep prod    # Verify pools/clusters
databricks secrets list-scopes          # Verify secrets

# Deploy with Asset Bundles
echo "=== Deploying ==="
databricks bundle deploy -t prod

# Verify deployment
databricks bundle summary -t prod
databricks jobs list | grep prod-etl

# Manual trigger to verify
echo "=== Verification Run ==="
RUN_ID=$(databricks jobs run-now --job-id $JOB_ID | jq -r '.run_id')
echo "Run ID: $RUN_ID"

# Monitor run
databricks runs get --run-id $RUN_ID --wait
```

### Step 5: Monitoring Setup

```python
# monitoring/health_check.py
from databricks.sdk import WorkspaceClient
from datetime import datetime, timedelta

def check_job_health(w: WorkspaceClient, job_id: int) -> dict:
    """Check job health metrics."""
    # Get recent runs
    runs = list(w.jobs.list_runs(
        job_id=job_id,
        completed_only=True,
        limit=10,
    ))

    if not runs:
        return {"status": "NO_RUNS", "healthy": False}

    # Calculate success rate
    successful = sum(1 for r in runs if r.state.result_state == "SUCCESS")
    success_rate = successful / len(runs)

    # Calculate average duration
    durations = [
        (r.end_time - r.start_time) / 1000 / 60  # minutes
        for r in runs if r.end_time
    ]
    avg_duration = sum(durations) / len(durations) if durations else 0

    # Check last run
    last_run = runs[0]
    last_state = last_run.state.result_state

    return {
        "status": "HEALTHY" if success_rate > 0.9 else "DEGRADED",
        "healthy": success_rate > 0.9 and last_state == "SUCCESS",
        "success_rate": success_rate,
        "avg_duration_minutes": avg_duration,
        "last_run_state": last_state,
        "last_run_time": datetime.fromtimestamp(last_run.start_time / 1000),
    }
```

### Step 6: Rollback Procedure

```bash
#!/bin/bash
# rollback.sh - Emergency rollback procedure

JOB_ID=$1
PREVIOUS_VERSION=$2

echo "=== ROLLBACK INITIATED ==="
echo "Job: $JOB_ID"
echo "Target Version: $PREVIOUS_VERSION"

# 1. Pause the job
echo "Pausing job..."
databricks jobs update --job-id $JOB_ID --json '{"settings": {"schedule": null}}'

# 2. Cancel active runs
echo "Cancelling active runs..."
databricks runs list --job-id $JOB_ID --active-only | \
  jq -r '.runs[].run_id' | \
  xargs -I {} databricks runs cancel --run-id {}

# 3. Reset to previous version
echo "Rolling back to version $PREVIOUS_VERSION..."
databricks bundle deploy -t prod --force

# 4. Re-enable schedule
echo "Re-enabling schedule..."
# (restore from backup config)

# 5. Trigger verification run
echo "Triggering verification run..."
databricks jobs run-now --job-id $JOB_ID

echo "=== ROLLBACK COMPLETE ==="
```

## Output
- Deployed production job
- Health checks passing
- Monitoring active
- Rollback procedure documented

## Error Handling
| Alert | Condition | Severity |
|-------|-----------|----------|
| Job Failed | `result_state = FAILED` | P1 |
| Long Running | Duration > 2x average | P2 |
| Consecutive Failures | 3+ failures in a row | P1 |
| Data Quality | Expectations failed | P2 |

## Examples

### Production Health Dashboard Query
```sql
-- Job health metrics (Unity Catalog system tables)
SELECT
  job_id,
  job_name,
  COUNT(*) as total_runs,
  SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) as successes,
  AVG(execution_duration) / 60000 as avg_minutes,
  MAX(start_time) as last_run
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY job_id, job_name
ORDER BY total_runs DESC
```

### Pre-Production Verification
```bash
# Comprehensive pre-prod check
databricks bundle validate -t prod && \
databricks bundle deploy -t prod --dry-run && \
echo "Validation passed, ready to deploy"
```

## Resources
- [Databricks Production Best Practices](https://docs.databricks.com/dev-tools/bundles/best-practices.html)
- [Job Configuration](https://docs.databricks.com/workflows/jobs/jobs.html)
- [Monitoring Guide](https://docs.databricks.com/administration-guide/workspace/monitoring.html)

## Next Steps
For version upgrades, see `databricks-upgrade-migration`.

Overview

This skill runs a Databricks production deployment checklist and provides rollback procedures to safely launch jobs and pipelines. It guides pre-deployment security and infrastructure checks, code quality verification, job configuration validation, deployment commands, monitoring setup, and emergency rollback steps. Use it to reduce deployment risk and streamline go-live operations.

How this skill works

The skill inspects staging sign-offs, workspace and secret scope presence, cluster and pool configurations, and job definitions. It validates tests and code review items, runs deployment commands (asset bundles), and sets up health checks that compute success rate and average duration. It also surfaces a scripted rollback flow to pause jobs, cancel active runs, redeploy a prior version, and trigger verification runs.

When to use it

Deploying Databricks jobs or pipelines to production
Preparing for a go-live or launch window
Verifying production cluster, pool, and secret configuration
Running pre-flight tests and manual verification runs
Executing an emergency rollback after a failing release

Best practices

Keep service principals and tokens rotated; avoid long-lived personal tokens
Store secrets in Databricks Secret Scopes; do not hardcode credentials
Validate on staging with integration and performance tests before prod deploy
Enforce cluster policies, pools, and autoscaling suitable for workload
Implement monitoring with success-rate thresholds and alerts for P1 conditions
Document and test rollback steps and maintain backup job configs

Example use cases

Pre-flight checklist before a scheduled ETL release to production
Automated deployment using databricks bundle deploy and post-deploy verification run
Health check script that reports success rate and average run duration to monitoring
Emergency rollback: pause schedules, cancel active runs, redeploy previous bundle, and verify
Audit-ready job YAML with tags, notifications, schedules, and resource configuration

FAQ

What triggers should I use to run this checklist?

Use triggers like "databricks production", "deploy databricks", "databricks go-live", or "databricks launch checklist" before any prod push.

How do I verify a rollback succeeded?

After rollback, run a verification job and confirm runs return SUCCESS. Check monitoring metrics: success rate, recent run state, and average duration to match expected baselines.