home / skills / shaul1991 / shaul-agents-plugin / data-engineer

data-engineer skill

/skills/data-engineer

This skill helps you design and optimize ETL pipelines, data warehouses, and streaming architectures with best practices for reliability and scalability.

npx playbooks add skill shaul1991/shaul-agents-plugin --skill data-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
727 B
---
name: data-engineer
description: Data Engineer Agent. ETL 파이프라인, 데이터 웨어하우스, 데이터 레이크 구축을 담당합니다.
allowed-tools: Read, Write, Edit, Bash, Grep, Glob
---

# Data Engineer Agent

## 역할
데이터 파이프라인 및 인프라 구축을 담당합니다.

## 담당 업무
- ETL/ELT 파이프라인
- 데이터 웨어하우스/레이크
- 스트리밍 처리
- 데이터 품질 관리

## 기술 스택
| 카테고리 | 도구 |
|----------|------|
| 오케스트레이션 | Airflow, Dagster |
| 변환 | dbt, Spark |
| 스트리밍 | Kafka, Flink |
| 저장소 | BigQuery, Snowflake |

## 산출물 위치
- 파이프라인: `data/pipelines/`
- 스키마: `data/schemas/`

Overview

This skill provides a Data Engineer agent focused on building and maintaining reliable ETL/ELT pipelines, data warehouses, and data lakes. It centralizes streaming and batch processing, schema management, and data quality controls to support analytics and ML workloads. The agent is tuned for orchestration, transformation, and storage best practices to ensure scalable, observable pipelines.

How this skill works

The agent inspects source systems, designs and deploys ETL/ELT flows, and configures orchestration using tools like Airflow or Dagster. It implements transformations with dbt or Spark, integrates streaming with Kafka or Flink, and provisions storage in BigQuery or Snowflake. The agent validates schemas, enforces data quality rules, and surfaces monitoring and lineage information for ongoing operations.

When to use it

  • Building or refactoring ETL/ELT pipelines for analytics or ML
  • Moving from ad-hoc scripts to orchestrated, observable workflows
  • Implementing streaming ingestion or real-time processing
  • Designing or migrating a data warehouse or data lake
  • Introducing schema governance and automated data quality checks

Best practices

  • Design idempotent, versioned pipelines that can be replayed and tested
  • Use orchestration for scheduling, retries, and dependency management (Airflow/Dagster)
  • Implement transformations with modular, testable units (dbt/Spark) and maintain clear schemas
  • Apply monitoring, alerts, and lineage to detect regressions and upstream failures
  • Enforce data quality rules early and automate schema evolution processes

Example use cases

  • Batch ETL that ingests operational databases into a Snowflake warehouse for BI dashboards
  • Realtime event stream processing with Kafka + Flink to power user-facing features
  • dbt-based transformation layer that standardizes metrics and enforces column-level tests
  • Migrating on-premise data lakes to BigQuery with partitioning and access controls
  • Setting up data quality pipelines that generate alerts and block downstream jobs on failures

FAQ

What orchestration tools does the agent prefer?

It works with Airflow and Dagster, selecting based on team needs, complexity, and existing infrastructure.

How are data quality and schema changes handled?

The agent implements automated tests, monitoring, and controlled schema migration processes to validate changes before promotion.

Where are pipeline artifacts and schema definitions placed?

Pipeline configurations and code are organized under a pipelines area, and canonical schema definitions are maintained in a schemas area to support discoverability and governance.