home / skills / jeremylongshore / claude-code-plugins-plus-skills / databricks-prod-checklist
/plugins/saas-packs/databricks-pack/skills/databricks-prod-checklist
This skill guides Databricks production deployment with a complete checklist, governance, and rollback steps to ensure a safe go-live.
npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill databricks-prod-checklistReview the files below or copy the command above to add this skill to your agents.
---
name: databricks-prod-checklist
description: |
Execute Databricks production deployment checklist and rollback procedures.
Use when deploying Databricks jobs to production, preparing for launch,
or implementing go-live procedures.
Trigger with phrases like "databricks production", "deploy databricks",
"databricks go-live", "databricks launch checklist".
allowed-tools: Read, Bash(databricks:*), Grep
version: 1.0.0
license: MIT
author: Jeremy Longshore <[email protected]>
---
# Databricks Production Checklist
## Overview
Complete checklist for deploying Databricks jobs and pipelines to production.
## Prerequisites
- Staging environment tested and verified
- Production workspace access
- Unity Catalog configured
- Monitoring and alerting ready
## Instructions
### Step 1: Pre-Deployment Configuration
#### Security
- [ ] Service principal configured for automation
- [ ] Secrets in Databricks Secret Scopes (not env vars)
- [ ] Token expiration set (max 90 days)
- [ ] Unity Catalog permissions configured
- [ ] Cluster policies enforced
- [ ] IP access lists configured
#### Infrastructure
- [ ] Production cluster pool configured
- [ ] Instance types validated for workload
- [ ] Autoscaling configured appropriately
- [ ] Spot instance ratio set (cost vs reliability)
### Step 2: Code Quality Verification
#### Testing
- [ ] Unit tests passing
- [ ] Integration tests on staging data
- [ ] Data quality tests defined
- [ ] Performance benchmarks met
```bash
# Run tests via Asset Bundles
databricks bundle validate -t prod
databricks bundle run -t staging test-job
# Verify test results
databricks runs get --run-id $RUN_ID | jq '.state.result_state'
```
#### Code Review
- [ ] No hardcoded credentials
- [ ] Error handling covers all failure modes
- [ ] Logging is production-appropriate
- [ ] Delta Lake best practices followed
- [ ] No `collect()` on large datasets
### Step 3: Job Configuration
```yaml
# resources/prod_job.yml
resources:
jobs:
etl_pipeline:
name: "prod-etl-pipeline"
tags:
environment: production
team: data-engineering
cost_center: analytics
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "America/New_York"
email_notifications:
on_failure:
- "[email protected]"
on_success:
- "[email protected]"
webhook_notifications:
on_failure:
- id: "slack-webhook-id"
max_concurrent_runs: 1
timeout_seconds: 14400 # 4 hours
tasks:
- task_key: bronze_ingest
job_cluster_key: etl_cluster
notebook_task:
notebook_path: /Repos/prod/pipelines/bronze
timeout_seconds: 3600
- task_key: silver_transform
depends_on:
- task_key: bronze_ingest
job_cluster_key: etl_cluster
notebook_task:
notebook_path: /Repos/prod/pipelines/silver
job_clusters:
- job_cluster_key: etl_cluster
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 4
autoscale:
min_workers: 2
max_workers: 8
spark_conf:
spark.sql.shuffle.partitions: "200"
spark.databricks.delta.optimizeWrite.enabled: "true"
instance_pool_id: "prod-pool-id"
```
### Step 4: Deployment Commands
```bash
# Pre-flight checks
echo "=== Pre-flight Checks ==="
databricks workspace list /Repos/prod/ # Verify repo exists
databricks clusters list | grep prod # Verify pools/clusters
databricks secrets list-scopes # Verify secrets
# Deploy with Asset Bundles
echo "=== Deploying ==="
databricks bundle deploy -t prod
# Verify deployment
databricks bundle summary -t prod
databricks jobs list | grep prod-etl
# Manual trigger to verify
echo "=== Verification Run ==="
RUN_ID=$(databricks jobs run-now --job-id $JOB_ID | jq -r '.run_id')
echo "Run ID: $RUN_ID"
# Monitor run
databricks runs get --run-id $RUN_ID --wait
```
### Step 5: Monitoring Setup
```python
# monitoring/health_check.py
from databricks.sdk import WorkspaceClient
from datetime import datetime, timedelta
def check_job_health(w: WorkspaceClient, job_id: int) -> dict:
"""Check job health metrics."""
# Get recent runs
runs = list(w.jobs.list_runs(
job_id=job_id,
completed_only=True,
limit=10,
))
if not runs:
return {"status": "NO_RUNS", "healthy": False}
# Calculate success rate
successful = sum(1 for r in runs if r.state.result_state == "SUCCESS")
success_rate = successful / len(runs)
# Calculate average duration
durations = [
(r.end_time - r.start_time) / 1000 / 60 # minutes
for r in runs if r.end_time
]
avg_duration = sum(durations) / len(durations) if durations else 0
# Check last run
last_run = runs[0]
last_state = last_run.state.result_state
return {
"status": "HEALTHY" if success_rate > 0.9 else "DEGRADED",
"healthy": success_rate > 0.9 and last_state == "SUCCESS",
"success_rate": success_rate,
"avg_duration_minutes": avg_duration,
"last_run_state": last_state,
"last_run_time": datetime.fromtimestamp(last_run.start_time / 1000),
}
```
### Step 6: Rollback Procedure
```bash
#!/bin/bash
# rollback.sh - Emergency rollback procedure
JOB_ID=$1
PREVIOUS_VERSION=$2
echo "=== ROLLBACK INITIATED ==="
echo "Job: $JOB_ID"
echo "Target Version: $PREVIOUS_VERSION"
# 1. Pause the job
echo "Pausing job..."
databricks jobs update --job-id $JOB_ID --json '{"settings": {"schedule": null}}'
# 2. Cancel active runs
echo "Cancelling active runs..."
databricks runs list --job-id $JOB_ID --active-only | \
jq -r '.runs[].run_id' | \
xargs -I {} databricks runs cancel --run-id {}
# 3. Reset to previous version
echo "Rolling back to version $PREVIOUS_VERSION..."
databricks bundle deploy -t prod --force
# 4. Re-enable schedule
echo "Re-enabling schedule..."
# (restore from backup config)
# 5. Trigger verification run
echo "Triggering verification run..."
databricks jobs run-now --job-id $JOB_ID
echo "=== ROLLBACK COMPLETE ==="
```
## Output
- Deployed production job
- Health checks passing
- Monitoring active
- Rollback procedure documented
## Error Handling
| Alert | Condition | Severity |
|-------|-----------|----------|
| Job Failed | `result_state = FAILED` | P1 |
| Long Running | Duration > 2x average | P2 |
| Consecutive Failures | 3+ failures in a row | P1 |
| Data Quality | Expectations failed | P2 |
## Examples
### Production Health Dashboard Query
```sql
-- Job health metrics (Unity Catalog system tables)
SELECT
job_id,
job_name,
COUNT(*) as total_runs,
SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) as successes,
AVG(execution_duration) / 60000 as avg_minutes,
MAX(start_time) as last_run
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY job_id, job_name
ORDER BY total_runs DESC
```
### Pre-Production Verification
```bash
# Comprehensive pre-prod check
databricks bundle validate -t prod && \
databricks bundle deploy -t prod --dry-run && \
echo "Validation passed, ready to deploy"
```
## Resources
- [Databricks Production Best Practices](https://docs.databricks.com/dev-tools/bundles/best-practices.html)
- [Job Configuration](https://docs.databricks.com/workflows/jobs/jobs.html)
- [Monitoring Guide](https://docs.databricks.com/administration-guide/workspace/monitoring.html)
## Next Steps
For version upgrades, see `databricks-upgrade-migration`.
This skill runs a Databricks production deployment checklist and provides rollback procedures to safely launch jobs and pipelines. It guides pre-deployment security and infrastructure checks, code quality verification, job configuration validation, deployment commands, monitoring setup, and emergency rollback steps. Use it to reduce deployment risk and streamline go-live operations.
The skill inspects staging sign-offs, workspace and secret scope presence, cluster and pool configurations, and job definitions. It validates tests and code review items, runs deployment commands (asset bundles), and sets up health checks that compute success rate and average duration. It also surfaces a scripted rollback flow to pause jobs, cancel active runs, redeploy a prior version, and trigger verification runs.
What triggers should I use to run this checklist?
Use triggers like "databricks production", "deploy databricks", "databricks go-live", or "databricks launch checklist" before any prod push.
How do I verify a rollback succeeded?
After rollback, run a verification job and confirm runs return SUCCESS. Check monitoring metrics: success rate, recent run state, and average duration to match expected baselines.