home / skills / benchflow-ai / skillsbench / senior-data-engineer

senior-data-engineer skill

/tasks/flink-query/environment/skills/senior-data-engineer

This skill helps you design and optimize scalable data pipelines and governance using batch and streaming patterns for reliable, cost-efficient data platforms.

npx playbooks add skill benchflow-ai/skillsbench --skill senior-data-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (14)
SKILL.md
23.1 KB
---

# === CORE IDENTITY ===
name: senior-data-engineer
title: Senior Data Engineer Skill Package
description: World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, real-time streaming, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, Flink, Kinesis, and modern data stack. Includes data modeling, pipeline orchestration, data quality, streaming quality monitoring, and DataOps. Use when designing data architectures, building batch or streaming data pipelines, optimizing data workflows, or implementing data governance.
domain: engineering
subdomain: data-engineering

# === WEBSITE DISPLAY ===
difficulty: advanced
time-saved: "TODO: Quantify time savings"
frequency: "TODO: Estimate usage frequency"
use-cases:
  - Designing data pipelines for ETL/ELT processes
  - Building data warehouses and data lakes
  - Implementing data quality and governance frameworks
  - Creating analytics dashboards and reporting
  - Building real-time streaming pipelines with Kafka and Flink
  - Implementing exactly-once streaming semantics
  - Monitoring streaming quality (consumer lag, data freshness, schema drift)

# === RELATIONSHIPS ===
related-agents: []
related-skills: []
related-commands: []
orchestrated-by: []

# === TECHNICAL ===
dependencies:
  scripts: []
  references: []
  assets: []
compatibility:
  python-version: 3.8+
  platforms: [macos, linux, windows]
tech-stack:
  - Python
  - SQL
  - Apache Spark
  - Airflow
  - dbt
  - Apache Kafka
  - Apache Flink
  - AWS Kinesis
  - Spark Structured Streaming
  - Kafka Streams
  - PostgreSQL
  - BigQuery
  - Snowflake
  - Docker
  - Schema Registry

# === EXAMPLES ===
examples:
  -
    title: Example Usage
    input: "TODO: Add example input for senior-data-engineer"
    output: "TODO: Add expected output"

# === ANALYTICS ===
stats:
  downloads: 0
  stars: 0
  rating: 0.0
  reviews: 0

# === VERSIONING ===
version: v2.0.0
author: Claude Skills Team
contributors: []
created: 2025-10-20
updated: 2025-12-16
license: MIT

# === DISCOVERABILITY ===
tags: [architecture, data, design, engineer, engineering, senior, streaming, kafka, flink, real-time]
featured: false
verified: true
---

# Senior Data Engineer

## Core Capabilities

- **Batch Pipeline Orchestration** - Design and implement production-ready ETL/ELT pipelines with Airflow, intelligent dependency resolution, retry logic, and comprehensive monitoring
- **Real-Time Streaming** - Build event-driven streaming pipelines with Kafka, Flink, Kinesis, and Spark Streaming with exactly-once semantics and sub-second latency
- **Data Quality Management** - Comprehensive batch and streaming data quality validation covering completeness, accuracy, consistency, timeliness, and validity
- **Streaming Quality Monitoring** - Track consumer lag, data freshness, schema drift, throughput, and dead letter queue rates for streaming pipelines
- **Performance Optimization** - Analyze and optimize pipeline performance with query optimization, Spark tuning, and cost analysis recommendations


## Key Workflows

### Workflow 1: Build ETL Pipeline

**Time:** 2-4 hours

**Steps:**
1. Design pipeline architecture using Lambda, Kappa, or Medallion pattern
2. Configure YAML pipeline definition with sources, transformations, targets
3. Generate Airflow DAG with `pipeline_orchestrator.py`
4. Define data quality validation rules
5. Deploy and configure monitoring/alerting

**Expected Output:** Production-ready ETL pipeline with 99%+ success rate, automated quality checks, and comprehensive monitoring

### Workflow 2: Build Real-Time Streaming Pipeline

**Time:** 3-5 days

**Steps:**
1. Select streaming architecture (Kappa vs Lambda) based on requirements
2. Configure streaming pipeline YAML (sources, processing, sinks, quality)
3. Generate Kafka configurations with `kafka_config_generator.py`
4. Generate Flink/Spark job scaffolding with `stream_processor.py`
5. Deploy and monitor with `streaming_quality_validator.py`

**Expected Output:** Streaming pipeline processing 10K+ events/sec with P99 latency < 1s, exactly-once delivery, and real-time quality monitoring


World-class data engineering for production-grade data systems, scalable pipelines, and enterprise data platforms.

## Overview

This skill provides comprehensive expertise in data engineering fundamentals through advanced production patterns. From designing medallion architectures to implementing real-time streaming pipelines, it covers the full spectrum of modern data engineering including ETL/ELT design, data quality frameworks, pipeline orchestration, and DataOps practices.

**What This Skill Provides:**
- Production-ready pipeline templates (Airflow, Spark, dbt)
- Comprehensive data quality validation framework
- Performance optimization and cost analysis tools
- Data architecture patterns (Lambda, Kappa, Medallion)
- Complete DataOps CI/CD workflows

**Best For:**
- Building scalable data pipelines for enterprise systems
- Implementing data quality and governance frameworks
- Optimizing ETL performance and cloud costs
- Designing modern data architectures (lake, warehouse, lakehouse)
- Production ML/AI data infrastructure

## Quick Start

### Pipeline Orchestration

```bash
# Generate Airflow DAG from configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/

# Validate pipeline configuration
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --validate

# Use incremental load template
python scripts/pipeline_orchestrator.py --template incremental --output dags/
```

### Data Quality Validation

```bash
# Validate CSV file with quality checks
python scripts/data_quality_validator.py --input data/sales.csv --output report.html

# Validate database table with custom rules
python scripts/data_quality_validator.py \
    --connection postgresql://user:pass@host/db \
    --table sales_transactions \
    --rules rules/sales_validation.yaml \
    --threshold 0.95
```

### Performance Optimization

```bash
# Analyze pipeline performance and get recommendations
python scripts/etl_performance_optimizer.py \
    --airflow-db postgresql://host/airflow \
    --dag-id sales_etl_pipeline \
    --days 30 \
    --optimize

# Analyze Spark job performance
python scripts/etl_performance_optimizer.py \
    --spark-history-server http://spark-history:18080 \
    --app-id app-20250115-001
```

### Real-Time Streaming

```bash
# Validate streaming pipeline configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate

# Generate Kafka topic and client configurations
python scripts/kafka_config_generator.py \
    --topic user-events \
    --partitions 12 \
    --replication 3 \
    --output kafka/topics/

# Generate exactly-once producer configuration
python scripts/kafka_config_generator.py \
    --producer \
    --profile exactly-once \
    --output kafka/producer.properties

# Generate Flink job scaffolding
python scripts/stream_processor.py \
    --config streaming_config.yaml \
    --mode flink \
    --generate \
    --output flink-jobs/

# Monitor streaming quality
python scripts/streaming_quality_validator.py \
    --lag --consumer-group events-processor --threshold 10000 \
    --freshness --topic processed-events --max-latency-ms 5000 \
    --output streaming-health-report.html
```

## Core Workflows

### 1. Building Production Data Pipelines

**Steps:**
1. **Design Architecture:** Choose pattern (Lambda, Kappa, Medallion) based on requirements
2. **Configure Pipeline:** Create YAML configuration with sources, transformations, targets
3. **Generate DAG:** `python scripts/pipeline_orchestrator.py --config config.yaml`
4. **Add Quality Checks:** Define validation rules for data quality
5. **Deploy & Monitor:** Deploy to Airflow, configure alerts, track metrics

**Pipeline Patterns:** See [frameworks.md](references/frameworks.md) for Lambda Architecture, Kappa Architecture, Medallion Architecture (Bronze/Silver/Gold), and Microservices Data patterns.

**Templates:** See [templates.md](references/templates.md) for complete Airflow DAG templates, Spark job templates, dbt models, and Docker configurations.

### 2. Data Quality Management

**Steps:**
1. **Define Rules:** Create validation rules covering completeness, accuracy, consistency
2. **Run Validation:** `python scripts/data_quality_validator.py --rules rules.yaml`
3. **Review Results:** Analyze quality scores and failed checks
4. **Integrate CI/CD:** Add validation to pipeline deployment process
5. **Monitor Trends:** Track quality scores over time

**Quality Framework:** See [frameworks.md](references/frameworks.md) for complete Data Quality Framework covering all dimensions (completeness, accuracy, consistency, timeliness, validity).

**Validation Templates:** See [templates.md](references/templates.md) for validation configuration examples and Python API usage.

### 3. Data Modeling & Transformation

**Steps:**
1. **Choose Modeling Approach:** Dimensional (Kimball), Data Vault 2.0, or One Big Table
2. **Design Schema:** Define fact tables, dimensions, and relationships
3. **Implement with dbt:** Create staging, intermediate, and mart models
4. **Handle SCD:** Implement slowly changing dimension logic (Type 1/2/3)
5. **Test & Deploy:** Run dbt tests, generate documentation, deploy

**Modeling Patterns:** See [frameworks.md](references/frameworks.md) for Dimensional Modeling (Kimball), Data Vault 2.0, One Big Table (OBT), and SCD implementations.

**dbt Templates:** See [templates.md](references/templates.md) for complete dbt model templates including staging, intermediate, fact tables, and SCD Type 2 logic.

### 4. Performance Optimization

**Steps:**
1. **Profile Pipeline:** Run performance analyzer on recent pipeline executions
2. **Identify Bottlenecks:** Review execution time breakdown and slow tasks
3. **Apply Optimizations:** Implement recommendations (partitioning, indexing, batching)
4. **Tune Spark Jobs:** Optimize memory, parallelism, and shuffle settings
5. **Measure Impact:** Compare before/after metrics, track cost savings

**Optimization Strategies:** See [frameworks.md](references/frameworks.md) for performance best practices including partitioning strategies, query optimization, and Spark tuning.

**Analysis Tools:** See [tools.md](references/tools.md) for complete documentation on etl_performance_optimizer.py with query analysis and Spark tuning.

### 5. Building Real-Time Streaming Pipelines

**Steps:**
1. **Architecture Selection:** Choose Kappa (streaming-only) or Lambda (batch + streaming) architecture
2. **Configure Pipeline:** Create YAML config with sources, processing engine, sinks, quality thresholds
3. **Generate Kafka Configs:** `python scripts/kafka_config_generator.py --topic events --partitions 12`
4. **Generate Job Scaffolding:** `python scripts/stream_processor.py --mode flink --generate`
5. **Deploy Infrastructure:** Use Docker Compose for local dev, Kubernetes for production
6. **Monitor Quality:** `python scripts/streaming_quality_validator.py --lag --freshness --throughput`

**Streaming Patterns:** See [frameworks.md](references/frameworks.md) for stateful processing, stream joins, windowing, exactly-once semantics, and CDC patterns.

**Templates:** See [templates.md](references/templates.md) for Flink DataStream jobs, Kafka Streams applications, PyFlink templates, and Docker Compose configurations.

## Python Tools

### pipeline_orchestrator.py

Automated Airflow DAG generation with intelligent dependency resolution and monitoring.

**Key Features:**
- Generate production-ready DAGs from YAML configuration
- Automatic task dependency resolution
- Built-in retry logic and error handling
- Multi-source support (PostgreSQL, S3, BigQuery, Snowflake)
- Integrated quality checks and alerting

**Usage:**
```bash
# Basic DAG generation
python scripts/pipeline_orchestrator.py --config pipeline_config.yaml --output dags/

# With validation
python scripts/pipeline_orchestrator.py --config config.yaml --validate

# From template
python scripts/pipeline_orchestrator.py --template incremental --output dags/
```

**Complete Documentation:** See [tools.md](references/tools.md) for full configuration options, templates, and integration examples.

### data_quality_validator.py

Comprehensive data quality validation framework with automated checks and reporting.

**Capabilities:**
- Multi-dimensional validation (completeness, accuracy, consistency, timeliness, validity)
- Great Expectations integration
- Custom business rule validation
- HTML/PDF report generation
- Anomaly detection
- Historical trend tracking

**Usage:**
```bash
# Validate with custom rules
python scripts/data_quality_validator.py \
    --input data/sales.csv \
    --rules rules/sales_validation.yaml \
    --output report.html

# Database table validation
python scripts/data_quality_validator.py \
    --connection postgresql://host/db \
    --table sales_transactions \
    --threshold 0.95
```

**Complete Documentation:** See [tools.md](references/tools.md) for rule configuration, API usage, and integration patterns.

### etl_performance_optimizer.py

Pipeline performance analysis with actionable optimization recommendations.

**Capabilities:**
- Airflow DAG execution profiling
- Bottleneck detection and analysis
- SQL query optimization suggestions
- Spark job tuning recommendations
- Cost analysis and optimization
- Historical performance trending

**Usage:**
```bash
# Analyze Airflow DAG
python scripts/etl_performance_optimizer.py \
    --airflow-db postgresql://host/airflow \
    --dag-id sales_etl_pipeline \
    --days 30 \
    --optimize

# Spark job analysis
python scripts/etl_performance_optimizer.py \
    --spark-history-server http://spark-history:18080 \
    --app-id app-20250115-001
```

**Complete Documentation:** See [tools.md](references/tools.md) for profiling options, optimization strategies, and cost analysis.

### stream_processor.py

Streaming pipeline configuration generator and validator for Kafka, Flink, and Kinesis.

**Capabilities:**
- Multi-platform support (Kafka, Flink, Kinesis, Spark Streaming)
- Configuration validation with best practice checks
- Flink/Spark job scaffolding generation
- Kafka topic configuration generation
- Docker Compose for local streaming stacks
- Exactly-once semantics configuration

**Usage:**
```bash
# Validate configuration
python scripts/stream_processor.py --config streaming_config.yaml --validate

# Generate Kafka configurations
python scripts/stream_processor.py --config streaming_config.yaml --mode kafka --generate

# Generate Flink job scaffolding
python scripts/stream_processor.py --config streaming_config.yaml --mode flink --generate --output flink-jobs/

# Generate Docker Compose for local development
python scripts/stream_processor.py --config streaming_config.yaml --mode docker --generate
```

**Complete Documentation:** See [tools.md](references/tools.md) for configuration format, validation checks, and generated outputs.

### streaming_quality_validator.py

Real-time streaming data quality monitoring with comprehensive health scoring.

**Capabilities:**
- Consumer lag monitoring with thresholds
- Data freshness validation (P50/P95/P99 latency)
- Schema drift detection
- Throughput analysis (events/sec, bytes/sec)
- Dead letter queue rate monitoring
- Overall quality scoring with recommendations
- Prometheus metrics export

**Usage:**
```bash
# Monitor consumer lag
python scripts/streaming_quality_validator.py \
    --lag --consumer-group events-processor --threshold 10000

# Monitor data freshness
python scripts/streaming_quality_validator.py \
    --freshness --topic processed-events --max-latency-ms 5000

# Full quality validation
python scripts/streaming_quality_validator.py \
    --lag --freshness --throughput --dlq \
    --output streaming-health-report.html
```

**Complete Documentation:** See [tools.md](references/tools.md) for all monitoring dimensions and integration patterns.

### kafka_config_generator.py

Production-grade Kafka configuration generator with performance and security profiles.

**Capabilities:**
- Topic configuration (partitions, replication, retention, compaction)
- Producer profiles (high-throughput, exactly-once, low-latency, ordered)
- Consumer profiles (exactly-once, high-throughput, batch)
- Kafka Streams configuration with state store tuning
- Security configuration (SASL-PLAIN, SASL-SCRAM, mTLS)
- Kafka Connect source/sink configurations
- Multiple output formats (properties, YAML, JSON)

**Usage:**
```bash
# Generate topic configuration
python scripts/kafka_config_generator.py \
    --topic user-events --partitions 12 --replication 3 --retention-hours 168

# Generate exactly-once producer
python scripts/kafka_config_generator.py \
    --producer --profile exactly-once --transactional-id producer-001

# Generate Kafka Streams config
python scripts/kafka_config_generator.py \
    --streams --application-id events-processor --exactly-once
```

**Complete Documentation:** See [tools.md](references/tools.md) for all profiles, security options, and Connect configurations.

## Reference Documentation

### Frameworks ([frameworks.md](references/frameworks.md))

Comprehensive data engineering frameworks and patterns:
- **Architecture Patterns:** Lambda, Kappa, Medallion, Microservices data architecture
- **Data Modeling:** Dimensional (Kimball), Data Vault 2.0, One Big Table
- **ETL/ELT Patterns:** Full load, incremental load, CDC, SCD, idempotent pipelines
- **Data Quality:** Complete framework covering all quality dimensions
- **DataOps:** CI/CD for data pipelines, testing strategies, monitoring
- **Orchestration:** Airflow DAG patterns, backfill strategies
- **Real-Time Streaming:** Stateful processing, stream joins, windowing strategies, exactly-once semantics, event time processing, watermarks, backpressure, Apache Flink patterns, AWS Kinesis patterns, CDC for streaming
- **Governance:** Data catalog, lineage tracking, access control

### Templates ([templates.md](references/templates.md))

Production-ready code templates and examples:
- **Airflow DAGs:** Complete ETL DAG, incremental load, dynamic task generation
- **Spark Jobs:** Batch processing, streaming, optimized configurations
- **dbt Models:** Staging, intermediate, fact tables, dimensions with SCD Type 2
- **SQL Patterns:** Incremental merge (upsert), deduplication, date spine, window functions
- **Python Pipelines:** Data quality validation class, retry decorators, error handling
- **Real-Time Streaming:** Apache Flink DataStream jobs (Java), Kafka Streams applications, PyFlink jobs, AWS Kinesis consumers, Docker Compose for streaming stack
- **Kafka Configs:** Producer/consumer properties templates, topic configurations, security configurations
- **Docker:** Dockerfiles for data pipelines, Docker Compose for local development including streaming stack (Kafka, Flink, Schema Registry)
- **Configuration:** dbt project config, Spark configuration, Airflow variables, streaming pipeline YAML
- **Testing:** pytest fixtures, integration tests, data quality tests

### Tools ([tools.md](references/tools.md))

Python automation tool documentation:
- **pipeline_orchestrator.py:** Complete usage guide, configuration format, DAG templates
- **data_quality_validator.py:** Validation rules, dimension checks, Great Expectations integration
- **etl_performance_optimizer.py:** Performance analysis, query optimization, Spark tuning
- **stream_processor.py:** Streaming pipeline configuration, validation, job scaffolding generation
- **streaming_quality_validator.py:** Consumer lag, data freshness, schema drift, throughput monitoring
- **kafka_config_generator.py:** Topic, producer, consumer, Kafka Streams, and Connect configurations
- **Integration Patterns:** Airflow, dbt, CI/CD, monitoring systems, Prometheus
- **Best Practices:** Configuration management, error handling, performance, monitoring, streaming quality

## Tech Stack

**Core Technologies:**
- **Languages:** Python 3.8+, SQL, Scala (Spark), Java (Flink)
- **Orchestration:** Apache Airflow, Prefect, Dagster
- **Batch Processing:** Apache Spark, dbt, Pandas
- **Stream Processing:** Apache Kafka, Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis
- **Storage:** PostgreSQL, BigQuery, Snowflake, Redshift, S3, GCS
- **Schema Management:** Confluent Schema Registry, AWS Glue Schema Registry
- **Containerization:** Docker, Kubernetes
- **Monitoring:** Datadog, Prometheus, Grafana, Kafka UI

**Data Platforms:**
- **Cloud Data Warehouses:** Snowflake, BigQuery, Redshift
- **Data Lakes:** Delta Lake, Apache Iceberg, Apache Hudi
- **Streaming Platforms:** Apache Kafka, AWS Kinesis, Google Pub/Sub, Azure Event Hubs
- **Stream Processing Engines:** Apache Flink, Kafka Streams, Spark Structured Streaming
- **Workflow:** Airflow, Prefect, Dagster

## Integration Points

This skill integrates with:
- **Orchestration:** Airflow, Prefect, Dagster for workflow management
- **Transformation:** dbt for SQL transformations and testing
- **Quality:** Great Expectations for data validation
- **Monitoring:** Datadog, Prometheus for pipeline monitoring
- **BI Tools:** Looker, Tableau, Power BI for analytics
- **ML Platforms:** MLflow, Kubeflow for ML pipeline integration
- **Version Control:** Git for pipeline code and configuration

See [tools.md](references/tools.md) for detailed integration patterns and examples.

## Best Practices

**Pipeline Design:**
1. Idempotent operations for safe reruns
2. Incremental processing where possible
3. Clear data lineage and documentation
4. Comprehensive error handling
5. Automated recovery mechanisms

**Data Quality:**
1. Define quality rules early
2. Validate at every pipeline stage
3. Automate quality monitoring
4. Track quality trends over time
5. Block bad data from downstream

**Performance:**
1. Partition large tables by date/region
2. Use columnar formats (Parquet, ORC)
3. Leverage predicate pushdown
4. Optimize for your query patterns
5. Monitor and tune regularly

**Operations:**
1. Version control everything
2. Automate testing and deployment
3. Implement comprehensive monitoring
4. Document runbooks for incidents
5. Regular performance reviews

## Performance Targets

**Batch Pipeline Execution:**
- P50 latency: < 5 minutes (hourly pipelines)
- P95 latency: < 15 minutes
- Success rate: > 99%
- Data freshness: < 1 hour behind source

**Streaming Pipeline Execution:**
- Throughput: 10K+ events/second sustained
- End-to-end latency: P99 < 1 second
- Consumer lag: < 10K records behind
- Exactly-once delivery: Zero duplicates or losses

**Data Quality (Batch):**
- Quality score: > 95%
- Completeness: > 99%
- Timeliness: < 2 hours data lag
- Zero critical failures

**Streaming Quality:**
- Data freshness: P95 < 5 minutes from event generation
- Late data rate: < 5% outside watermark window
- Dead letter queue rate: < 1%
- Schema compatibility: 100% backward/forward compatible changes

**Cost Efficiency:**
- Cost per GB processed: < $0.10
- Cloud cost trend: Stable or decreasing
- Resource utilization: > 70%

## Resources

- **Frameworks Guide:** [references/frameworks.md](references/frameworks.md)
- **Code Templates:** [references/templates.md](references/templates.md)
- **Tool Documentation:** [references/tools.md](references/tools.md)
- **Python Scripts:** `scripts/` directory

---

**Version:** 2.0.0
**Last Updated:** December 16, 2025
**Documentation Structure:** Progressive disclosure with comprehensive references
**Streaming Enhancement:** Task #8 - Real-time streaming capabilities added

Overview

This skill delivers world-class data engineering expertise for building scalable batch and streaming data pipelines, ETL/ELT systems, and resilient data infrastructure. It provides production-ready templates, data quality frameworks, performance optimization tools, and orchestration patterns for Airflow, Spark, Kafka, Flink, dbt, and modern cloud platforms. Use it to design architectures, generate pipeline scaffolding, and enforce DataOps and governance at scale.

How this skill works

The skill inspects pipeline requirements and generates runnable artifacts: Airflow DAGs, Spark/Flink job scaffolding, Kafka topic and client configs, and dbt models from YAML definitions. It runs multi-dimensional data quality checks (batch and streaming), profiles performance, and outputs optimization recommendations and monitoring configs. For streaming, it validates configs, generates exactly-once producers, and produces health reports with lag, freshness, throughput, and schema-drift metrics.

When to use it

  • Design or evaluate data architectures (Lambda, Kappa, Medallion) for new projects
  • Generate production Airflow DAGs, dbt models, or streaming job scaffolding from config
  • Implement batch and streaming data quality validation and monitoring
  • Optimize ETL/Spark jobs for performance and cost savings
  • Build or harden real-time pipelines processing high event rates with low latency

Best practices

  • Start with a clear YAML config describing sources, transforms, and sinks to enable repeatable generation
  • Adopt medallion or Kappa patterns based on latency and reprocessing needs
  • Embed data quality checks into CI/CD so validation runs on every deployment
  • Profile workloads before tuning—use historical metrics to guide partitioning and Spark settings
  • Expose metrics to Prometheus and configure alerts for lag, freshness, and DLQ rates

Example use cases

  • Generate an Airflow DAG and automated quality checks from pipeline_config.yaml for a nightly ETL job
  • Create Kafka topics and exactly-once producer configs and scaffold a Flink job for a 10K+ events/sec stream
  • Run data_quality_validator.py against a transaction table and integrate results into deployment gates
  • Use etl_performance_optimizer.py to profile a slow DAG, get partitioning and query suggestions, and estimate cost savings
  • Deploy streaming_quality_validator.py to monitor consumer lag, schema drift, and produce a daily health report

FAQ

What outputs will I get from a YAML pipeline definition?

You get production-ready Airflow DAGs, quality-check definitions, monitoring configs, and optional dbt model templates ready for deployment.

Can this skill handle both batch and streaming quality checks?

Yes. It runs multidimensional batch validations and streaming health checks (lag, freshness, throughput, schema drift) and exports metrics for alerting.