home / skills / questnova502 / claude-skills-sync / senior-ml-engineer

senior-ml-engineer skill

safe

This skill helps you productionize ML models and build scalable ML platforms, boosting deployment speed, reliability, and governance.

npx playbooks add skill questnova502/claude-skills-sync --skill senior-ml-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (7)

SKILL.md

5.4 KB

---
name: senior-ml-engineer
description: World-class ML engineering skill for productionizing ML models, MLOps, and building scalable ML systems. Expertise in PyTorch, TensorFlow, model deployment, feature stores, model monitoring, and ML infrastructure. Includes LLM integration, fine-tuning, RAG systems, and agentic AI. Use when deploying ML models, building ML platforms, implementing MLOps, or integrating LLMs into production systems.
---

# Senior ML/AI Engineer

World-class senior ml/ai engineer skill for production-grade AI/ML/Data systems.

## Quick Start

### Main Capabilities

```bash
# Core Tool 1
python scripts/model_deployment_pipeline.py --input data/ --output results/

# Core Tool 2  
python scripts/rag_system_builder.py --target project/ --analyze

# Core Tool 3
python scripts/ml_monitoring_suite.py --config config.yaml --deploy
```

## Core Expertise

This skill covers world-class capabilities in:

- Advanced production patterns and architectures
- Scalable system design and implementation
- Performance optimization at scale
- MLOps and DataOps best practices
- Real-time processing and inference
- Distributed computing frameworks
- Model deployment and monitoring
- Security and compliance
- Cost optimization
- Team leadership and mentoring

## Tech Stack

**Languages:** Python, SQL, R, Scala, Go
**ML Frameworks:** PyTorch, TensorFlow, Scikit-learn, XGBoost
**Data Tools:** Spark, Airflow, dbt, Kafka, Databricks
**LLM Frameworks:** LangChain, LlamaIndex, DSPy
**Deployment:** Docker, Kubernetes, AWS/GCP/Azure
**Monitoring:** MLflow, Weights & Biases, Prometheus
**Databases:** PostgreSQL, BigQuery, Snowflake, Pinecone

## Reference Documentation

### 1. Mlops Production Patterns

Comprehensive guide available in `references/mlops_production_patterns.md` covering:

- Advanced patterns and best practices
- Production implementation strategies
- Performance optimization techniques
- Scalability considerations
- Security and compliance
- Real-world case studies

### 2. Llm Integration Guide

Complete workflow documentation in `references/llm_integration_guide.md` including:

- Step-by-step processes
- Architecture design patterns
- Tool integration guides
- Performance tuning strategies
- Troubleshooting procedures

### 3. Rag System Architecture

Technical reference guide in `references/rag_system_architecture.md` with:

- System design principles
- Implementation examples
- Configuration best practices
- Deployment strategies
- Monitoring and observability

## Production Patterns

### Pattern 1: Scalable Data Processing

Enterprise-scale data processing with distributed computing:

- Horizontal scaling architecture
- Fault-tolerant design
- Real-time and batch processing
- Data quality validation
- Performance monitoring

### Pattern 2: ML Model Deployment

Production ML system with high availability:

- Model serving with low latency
- A/B testing infrastructure
- Feature store integration
- Model monitoring and drift detection
- Automated retraining pipelines

### Pattern 3: Real-Time Inference

High-throughput inference system:

- Batching and caching strategies
- Load balancing
- Auto-scaling
- Latency optimization
- Cost optimization

## Best Practices

### Development

- Test-driven development
- Code reviews and pair programming
- Documentation as code
- Version control everything
- Continuous integration

### Production

- Monitor everything critical
- Automate deployments
- Feature flags for releases
- Canary deployments
- Comprehensive logging

### Team Leadership

- Mentor junior engineers
- Drive technical decisions
- Establish coding standards
- Foster learning culture
- Cross-functional collaboration

## Performance Targets

**Latency:**
- P50: < 50ms
- P95: < 100ms
- P99: < 200ms

**Throughput:**
- Requests/second: > 1000
- Concurrent users: > 10,000

**Availability:**
- Uptime: 99.9%
- Error rate: < 0.1%

## Security & Compliance

- Authentication & authorization
- Data encryption (at rest & in transit)
- PII handling and anonymization
- GDPR/CCPA compliance
- Regular security audits
- Vulnerability management

## Common Commands

```bash
# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/

# Training
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth

# Deployment
docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/

# Monitoring
kubectl logs -f deployment/service
python scripts/health_check.py
```

## Resources

- Advanced Patterns: `references/mlops_production_patterns.md`
- Implementation Guide: `references/llm_integration_guide.md`
- Technical Reference: `references/rag_system_architecture.md`
- Automation Scripts: `scripts/` directory

## Senior-Level Responsibilities

As a world-class senior professional:

1. **Technical Leadership**
   - Drive architectural decisions
   - Mentor team members
   - Establish best practices
   - Ensure code quality

2. **Strategic Thinking**
   - Align with business goals
   - Evaluate trade-offs
   - Plan for scale
   - Manage technical debt

3. **Collaboration**
   - Work across teams
   - Communicate effectively
   - Build consensus
   - Share knowledge

4. **Innovation**
   - Stay current with research
   - Experiment with new approaches
   - Contribute to community
   - Drive continuous improvement

5. **Production Excellence**
   - Ensure high availability
   - Monitor proactively
   - Optimize performance
   - Respond to incidents

Overview

This skill packages world-class senior ML engineering expertise for productionizing ML models, MLOps, and building scalable ML systems. It focuses on end-to-end delivery: architecture, deployment, monitoring, and operational excellence for both conventional ML and LLM-driven systems. Use it to design reliable, high-performance ML platforms and lead implementation across teams.

How this skill works

The skill inspects your production requirements, infrastructure constraints, and model characteristics to recommend architecture patterns, deployment pipelines, and monitoring strategies. It translates requirements into concrete artifacts: CI/CD flows, serving/topology design, feature-store integration, retraining triggers, and observability plans. It also provides practical commands, scripts, and configuration recommendations for common stacks (Kubernetes, Docker, cloud providers, TensorFlow/PyTorch, and LLM tooling).

When to use it

Deploying or scaling ML models from prototype to production
Designing MLOps pipelines, automated retraining, and feature stores
Integrating LLMs, RAG systems, or fine-tuning workflows into products
Implementing monitoring, drift detection, and model observability
Optimizing inference latency, throughput, and cost for large-scale systems

Best practices

Design for idempotent, observable pipelines with automated tests and CI/CD
Use feature stores and versioned data to ensure reproducible training and serving
Implement gradual rollouts (feature flags, canary/A/B tests) and automated rollback
Monitor both data and model signals (drift, performance, latency) and trigger retraining
Prioritize security: auth, encryption, PII handling, and regular audits

Example use cases

Build a low-latency model serving infrastructure on Kubernetes with autoscaling and caching
Create a retraining pipeline that detects data drift and triggers scheduled retrains with validation gates
Integrate an LLM with a RAG system for product search, including vector DB and relevance monitoring
Design cost-optimized inference for high-throughput APIs with batching and load balancing
Establish team processes for code reviews, TDD, and mentoring to raise engineering quality

FAQ

What ML frameworks and deployment platforms are supported?

Guidance covers PyTorch, TensorFlow, Scikit-learn, XGBoost, Docker, Kubernetes, and major cloud providers; it also includes LLM frameworks and vector DB integrations.

How does this skill handle monitoring and drift detection?

It recommends metrics to collect, tooling for model observability, alert thresholds, and automated retraining pipelines tied to drift signals and validation checks.

Can it help with security and compliance requirements?

Yes. It provides patterns for authentication/authorization, encryption, PII handling, compliance considerations, and guidance for regular security reviews.