home / skills / nilecui / skillsbase / senior-data-engineer

senior-data-engineer skill

/.cursor/skills/senior-data-engineer

This skill helps design scalable data pipelines and governance with Python, SQL, Spark, and Airflow for robust, production-grade analytics.

npx playbooks add skill nilecui/skillsbase --skill senior-data-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
5.4 KB
---
name: senior-data-engineer
description: World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.
---

# Senior Data Engineer

World-class senior data engineer skill for production-grade AI/ML/Data systems.

## Quick Start

### Main Capabilities

```bash
# Core Tool 1
python scripts/pipeline_orchestrator.py --input data/ --output results/

# Core Tool 2  
python scripts/data_quality_validator.py --target project/ --analyze

# Core Tool 3
python scripts/etl_performance_optimizer.py --config config.yaml --deploy
```

## Core Expertise

This skill covers world-class capabilities in:

- Advanced production patterns and architectures
- Scalable system design and implementation
- Performance optimization at scale
- MLOps and DataOps best practices
- Real-time processing and inference
- Distributed computing frameworks
- Model deployment and monitoring
- Security and compliance
- Cost optimization
- Team leadership and mentoring

## Tech Stack

**Languages:** Python, SQL, R, Scala, Go
**ML Frameworks:** PyTorch, TensorFlow, Scikit-learn, XGBoost
**Data Tools:** Spark, Airflow, dbt, Kafka, Databricks
**LLM Frameworks:** LangChain, LlamaIndex, DSPy
**Deployment:** Docker, Kubernetes, AWS/GCP/Azure
**Monitoring:** MLflow, Weights & Biases, Prometheus
**Databases:** PostgreSQL, BigQuery, Snowflake, Pinecone

## Reference Documentation

### 1. Data Pipeline Architecture

Comprehensive guide available in `references/data_pipeline_architecture.md` covering:

- Advanced patterns and best practices
- Production implementation strategies
- Performance optimization techniques
- Scalability considerations
- Security and compliance
- Real-world case studies

### 2. Data Modeling Patterns

Complete workflow documentation in `references/data_modeling_patterns.md` including:

- Step-by-step processes
- Architecture design patterns
- Tool integration guides
- Performance tuning strategies
- Troubleshooting procedures

### 3. Dataops Best Practices

Technical reference guide in `references/dataops_best_practices.md` with:

- System design principles
- Implementation examples
- Configuration best practices
- Deployment strategies
- Monitoring and observability

## Production Patterns

### Pattern 1: Scalable Data Processing

Enterprise-scale data processing with distributed computing:

- Horizontal scaling architecture
- Fault-tolerant design
- Real-time and batch processing
- Data quality validation
- Performance monitoring

### Pattern 2: ML Model Deployment

Production ML system with high availability:

- Model serving with low latency
- A/B testing infrastructure
- Feature store integration
- Model monitoring and drift detection
- Automated retraining pipelines

### Pattern 3: Real-Time Inference

High-throughput inference system:

- Batching and caching strategies
- Load balancing
- Auto-scaling
- Latency optimization
- Cost optimization

## Best Practices

### Development

- Test-driven development
- Code reviews and pair programming
- Documentation as code
- Version control everything
- Continuous integration

### Production

- Monitor everything critical
- Automate deployments
- Feature flags for releases
- Canary deployments
- Comprehensive logging

### Team Leadership

- Mentor junior engineers
- Drive technical decisions
- Establish coding standards
- Foster learning culture
- Cross-functional collaboration

## Performance Targets

**Latency:**
- P50: < 50ms
- P95: < 100ms
- P99: < 200ms

**Throughput:**
- Requests/second: > 1000
- Concurrent users: > 10,000

**Availability:**
- Uptime: 99.9%
- Error rate: < 0.1%

## Security & Compliance

- Authentication & authorization
- Data encryption (at rest & in transit)
- PII handling and anonymization
- GDPR/CCPA compliance
- Regular security audits
- Vulnerability management

## Common Commands

```bash
# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/

# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth

# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/

# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py
```

## Resources

- Advanced Patterns: `references/data_pipeline_architecture.md`
- Implementation Guide: `references/data_modeling_patterns.md`
- Technical Reference: `references/dataops_best_practices.md`
- Automation Scripts: `scripts/` directory

## Senior-Level Responsibilities

As a world-class senior professional:

1. **Technical Leadership**
   - Drive architectural decisions
   - Mentor team members
   - Establish best practices
   - Ensure code quality

2. **Strategic Thinking**
   - Align with business goals
   - Evaluate trade-offs
   - Plan for scale
   - Manage technical debt

3. **Collaboration**
   - Work across teams
   - Communicate effectively
   - Build consensus
   - Share knowledge

4. **Innovation**
   - Stay current with research
   - Experiment with new approaches
   - Contribute to community
   - Drive continuous improvement

5. **Production Excellence**
   - Ensure high availability
   - Monitor proactively
   - Optimize performance
   - Respond to incidents

Overview

This skill delivers senior-level data engineering expertise for designing and operating production-grade data pipelines, ETL/ELT systems, and data infrastructure. It combines practical knowledge of Python, SQL, Spark, Airflow, dbt, Kafka and cloud platforms with patterns for scalability, reliability, and cost efficiency. Use it to architect systems, improve pipeline performance, and establish DataOps and governance practices.

How this skill works

The skill inspects architecture, pipeline code, orchestration, and monitoring to identify bottlenecks, failure modes, and cost drivers. It recommends design patterns for batch and real-time processing, data modeling, and feature stores, and it prescribes concrete fixes: refactoring Spark jobs, adding Airflow best practices, implementing dbt models, or tuning Kafka throughput. It also outlines observability, testing, security, and deployment steps needed to move from prototype to production.

When to use it

  • Designing or reviewing scalable data platform architectures
  • Building or refactoring ETL/ELT pipelines for reliability and performance
  • Implementing streaming or real-time inference systems
  • Establishing DataOps, testing, and monitoring practices
  • Optimizing cloud costs, throughput, or latency for data workloads

Best practices

  • Adopt test-driven development and version control for all pipeline code
  • Use modular data models and dbt for reproducible transformations
  • Orchestrate with Airflow or equivalent, with retries, SLA alerts, and idempotency
  • Instrument pipelines with metrics, tracing, and structured logs for observability
  • Enforce security: encryption, access control, PII handling, and regular audits
  • Automate deployments with CI/CD and use canary/feature-flag releases

Example use cases

  • Refactor legacy ETL to Spark for horizontal scaling and cost reduction
  • Build a Kafka-backed real-time feature pipeline for low-latency ML inference
  • Design a golden-table architecture with dbt and incremental models
  • Implement DataOps workflows: CI tests, deployment pipelines, and monitoring
  • Tune Airflow DAGs and resource configs to meet P95 latency targets

FAQ

What languages and tools does this skill prioritize?

Primary focus is Python and SQL, plus Spark, Airflow, dbt, Kafka, and cloud native tooling (AWS/GCP/Azure).

Can it help with both batch and real-time systems?

Yes. It covers patterns for batch ETL, streaming ingestion, and real-time inference with latency and throughput guidance.

Does it include security and compliance guidance?

Yes. Recommendations include encryption, access controls, PII handling, GDPR/CCPA considerations, and audit practices.