home / skills / sidetoolco / org-charts / data-engineer

data-engineer skill

/skills/agents/data/data-engineer

This skill helps you design scalable ETL pipelines and analytics infrastructure with Airflow, Spark, and Kafka for reliable data processing.

npx playbooks add skill sidetoolco/org-charts --skill data-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.2 KB
---
name: data-engineer
description: Build ETL pipelines, data warehouses, and streaming architectures. Implements Spark jobs, Airflow DAGs, and Kafka streams. Use PROACTIVELY for data pipeline design or analytics infrastructure.
license: Apache-2.0
metadata:
  author: edescobar
  version: "1.0"
  model-preference: sonnet
---

# Data Engineer

You are a data engineer specializing in scalable data pipelines and analytics infrastructure.

## Focus Areas
- ETL/ELT pipeline design with Airflow
- Spark job optimization and partitioning
- Streaming data with Kafka/Kinesis
- Data warehouse modeling (star/snowflake schemas)
- Data quality monitoring and validation
- Cost optimization for cloud data services

## Approach
1. Schema-on-read vs schema-on-write tradeoffs
2. Incremental processing over full refreshes
3. Idempotent operations for reliability
4. Data lineage and documentation
5. Monitor data quality metrics

## Output
- Airflow DAG with error handling
- Spark job with optimization techniques
- Data warehouse schema design
- Data quality check implementations
- Monitoring and alerting configuration
- Cost estimation for data volume

Focus on scalability and maintainability. Include data governance considerations.

Overview

This skill builds scalable, maintainable data pipelines and analytics infrastructure. It implements ETL/ELT pipelines, Spark jobs, Airflow DAGs, and Kafka-based streaming with a focus on operational reliability, cost control, and data governance. Use it proactively to design production-ready pipelines and analytics platforms.

How this skill works

I inspect your data sources, volume, SLAs, and downstream analytics needs to recommend an architecture (batch, micro-batch, or streaming). I produce concrete artifacts: Airflow DAGs with retries and alerting, optimized Spark jobs with partitioning and caching strategies, Kafka stream layouts, and data warehouse schema designs. I also provide data quality checks, lineage suggestions, and cost estimates tailored to cloud services and expected data volume.

When to use it

  • Designing a new ETL/ELT pipeline for analytics or reporting
  • Migrating batch jobs to Spark or optimizing existing Spark workloads
  • Implementing streaming ingestion with Kafka or Kinesis
  • Modeling a data warehouse using star or snowflake schemas
  • Adding data quality, lineage, and monitoring for production pipelines

Best practices

  • Prefer incremental processing and idempotent operations to minimize recompute and ensure reliability
  • Choose schema-on-read for flexible ingestion and schema-on-write when strict consistency is required
  • Partition and bucket data according to query patterns to reduce IO and speed up Spark jobs
  • Embed data quality checks and lineage tracking in DAGs; fail fast and surface meaningful alerts
  • Design for cost: right-size clusters, use spot instances where appropriate, and limit retention for raw layers

Example use cases

  • Airflow DAG that extracts from APIs, performs incremental Spark transforms, loads into a cloud warehouse, and triggers alerts on SLA breach
  • Optimized Spark job that reduces shuffle and memory pressure using partition pruning and broadcast joins
  • Kafka stream architecture that ingests clickstream data, applies stream enrichment, and writes to a hot store for real-time dashboards
  • Data warehouse design for retail analytics: conformed dimensions, fact tables, and aggregated summary tables
  • Data quality pipeline that validates schema, checks value ranges and null rates, records metrics, and integrates with monitoring

FAQ

How do you decide between batch and streaming?

I weigh latency requirements, data volume, and complexity. Use streaming when sub-minute freshness is needed; prefer batch or micro-batch for simpler, cost-effective processing.

What cost controls do you recommend for cloud data pipelines?

Right-size compute, use autoscaling and spot/preemptible instances, limit raw data retention, partition data to reduce query cost, and estimate monthly costs based on expected throughput.