home / skills / personamanagmentlayer / pcl / devops-expert
This skill provides expert DevOps guidance on automation, CI/CD pipelines, and infrastructure for faster, reliable deployments.
npx playbooks add skill personamanagmentlayer/pcl --skill devops-expertReview the files below or copy the command above to add this skill to your agents.
---
name: devops-expert
version: 1.0.0
description: Expert-level DevOps practices, culture, automation, and continuous delivery
category: devops
tags: [devops, ci-cd, automation, infrastructure, culture]
allowed-tools:
- Read
- Write
- Edit
- Bash(*)
---
# DevOps Expert
Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.
## Core Concepts
### DevOps Culture
- Collaboration and communication
- Shared responsibility
- Continuous improvement
- Breaking down silos
- Blameless culture
- Measuring everything
### Automation
- Infrastructure as Code (IaC)
- Configuration management
- Deployment automation
- Testing automation
- Monitoring automation
- Self-service platforms
### CI/CD
- Continuous Integration
- Continuous Delivery
- Continuous Deployment
- Pipeline as Code
- Artifact management
- Release strategies
## CI/CD Pipeline
```yaml
# GitHub Actions Example
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linting
run: npm run lint
- name: Run tests
run: npm test
- name: Run security scan
run: npm audit
- name: Upload coverage
uses: codecov/codecov-action@v3
build:
needs: test
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v3
- name: Log in to Container Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
deploy-staging:
needs: build
if: github.ref == 'refs/heads/develop'
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace=staging
- name: Wait for rollout
run: kubectl rollout status deployment/myapp -n staging
- name: Run smoke tests
run: npm run test:smoke
deploy-production:
needs: build
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace=production
- name: Wait for rollout
run: kubectl rollout status deployment/myapp -n production
```
## Infrastructure as Code
```python
# Pulumi Infrastructure as Code
import pulumi
import pulumi_aws as aws
# VPC
vpc = aws.ec2.Vpc("main-vpc",
cidr_block="10.0.0.0/16",
enable_dns_hostnames=True,
enable_dns_support=True,
tags={"Name": "main-vpc"})
# Subnets
public_subnet = aws.ec2.Subnet("public-subnet",
vpc_id=vpc.id,
cidr_block="10.0.1.0/24",
availability_zone="us-east-1a",
map_public_ip_on_launch=True,
tags={"Name": "public-subnet"})
private_subnet = aws.ec2.Subnet("private-subnet",
vpc_id=vpc.id,
cidr_block="10.0.2.0/24",
availability_zone="us-east-1b",
tags={"Name": "private-subnet"})
# Internet Gateway
igw = aws.ec2.InternetGateway("igw",
vpc_id=vpc.id,
tags={"Name": "main-igw"})
# Route Table
route_table = aws.ec2.RouteTable("public-rt",
vpc_id=vpc.id,
routes=[
aws.ec2.RouteTableRouteArgs(
cidr_block="0.0.0.0/0",
gateway_id=igw.id,
)
],
tags={"Name": "public-rt"})
# Security Group
security_group = aws.ec2.SecurityGroup("web-sg",
vpc_id=vpc.id,
description="Allow HTTP and HTTPS traffic",
ingress=[
aws.ec2.SecurityGroupIngressArgs(
protocol="tcp",
from_port=80,
to_port=80,
cidr_blocks=["0.0.0.0/0"],
),
aws.ec2.SecurityGroupIngressArgs(
protocol="tcp",
from_port=443,
to_port=443,
cidr_blocks=["0.0.0.0/0"],
),
],
egress=[
aws.ec2.SecurityGroupEgressArgs(
protocol="-1",
from_port=0,
to_port=0,
cidr_blocks=["0.0.0.0/0"],
)
])
# EKS Cluster
cluster = aws.eks.Cluster("app-cluster",
role_arn=cluster_role.arn,
vpc_config=aws.eks.ClusterVpcConfigArgs(
subnet_ids=[public_subnet.id, private_subnet.id],
security_group_ids=[security_group.id],
))
# Export outputs
pulumi.export("vpc_id", vpc.id)
pulumi.export("cluster_name", cluster.name)
pulumi.export("cluster_endpoint", cluster.endpoint)
```
## Deployment Strategies
```python
from typing import List, Dict
import time
class DeploymentStrategy:
"""Implement various deployment strategies"""
def __init__(self, service_name: str):
self.service_name = service_name
def blue_green_deployment(self, blue_version: str, green_version: str):
"""Blue-Green deployment"""
# Deploy green environment
self.deploy_environment("green", green_version)
# Run tests on green
if self.run_tests("green"):
# Switch traffic to green
self.switch_traffic("green")
# Keep blue for rollback
print(f"Deployment successful. Blue ({blue_version}) kept for rollback.")
else:
# Rollback - keep blue active
print("Tests failed on green. Keeping blue active.")
def canary_deployment(self, current_version: str, new_version: str,
canary_percentage: int = 10):
"""Canary deployment"""
# Deploy canary with small percentage
self.deploy_canary(new_version, canary_percentage)
# Monitor metrics
metrics = self.monitor_canary_metrics(duration_minutes=10)
if metrics['error_rate'] < 0.1 and metrics['latency_p95'] < 500:
# Gradually increase canary traffic
for percentage in [25, 50, 75, 100]:
self.update_canary_traffic(percentage)
time.sleep(300) # 5 minutes between increases
if not self.check_health():
self.rollback(current_version)
return False
print(f"Canary deployment successful: {new_version}")
return True
else:
self.rollback(current_version)
print("Canary deployment failed - rolled back")
return False
def rolling_deployment(self, version: str, batch_size: int = 1):
"""Rolling deployment"""
instances = self.get_instances()
for i in range(0, len(instances), batch_size):
batch = instances[i:i + batch_size]
# Update batch
for instance in batch:
self.update_instance(instance, version)
self.wait_for_healthy(instance)
# Verify batch health
if not self.check_health():
print(f"Rolling deployment failed at batch {i//batch_size + 1}")
return False
print(f"Rolling deployment successful: {version}")
return True
def feature_flag_deployment(self, feature_name: str, enabled: bool,
rollout_percentage: int = 100):
"""Feature flag based deployment"""
return {
'feature': feature_name,
'enabled': enabled,
'rollout_percentage': rollout_percentage,
'targeting': {
'user_segments': ['beta_users'] if rollout_percentage < 100 else ['all']
}
}
```
## Configuration Management
```python
from typing import Dict, Any
import yaml
import json
class ConfigurationManager:
"""Manage application configuration"""
def __init__(self, environment: str):
self.environment = environment
self.config = {}
def load_config(self, config_file: str):
"""Load configuration from file"""
with open(config_file, 'r') as f:
if config_file.endswith('.yaml') or config_file.endswith('.yml'):
self.config = yaml.safe_load(f)
elif config_file.endswith('.json'):
self.config = json.load(f)
def get(self, key: str, default: Any = None) -> Any:
"""Get configuration value"""
keys = key.split('.')
value = self.config
for k in keys:
if isinstance(value, dict):
value = value.get(k)
else:
return default
if value is None:
return default
return value
def merge_environment_config(self, env_config: Dict):
"""Merge environment-specific configuration"""
self.config = self._deep_merge(self.config, env_config)
def _deep_merge(self, base: Dict, override: Dict) -> Dict:
"""Deep merge two dictionaries"""
result = base.copy()
for key, value in override.items():
if key in result and isinstance(result[key], dict) and isinstance(value, dict):
result[key] = self._deep_merge(result[key], value)
else:
result[key] = value
return result
def validate_required_keys(self, required_keys: List[str]) -> List[str]:
"""Validate that required configuration keys exist"""
missing = []
for key in required_keys:
if self.get(key) is None:
missing.append(key)
return missing
```
## Monitoring and Observability
```python
import logging
from opencensus.ext.azure import metrics_exporter
from opencensus.stats import aggregation as aggregation_module
from opencensus.stats import measure as measure_module
from opencensus.stats import stats as stats_module
from opencensus.stats import view as view_module
from opencensus.tags import tag_map as tag_map_module
class ObservabilityStack:
"""Implement observability best practices"""
def __init__(self):
self.logger = self._setup_logging()
self.stats = stats_module.stats
self.view_manager = self.stats.view_manager
def _setup_logging(self) -> logging.Logger:
"""Setup structured logging"""
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
formatter = logging.Formatter(
'{"time": "%(asctime)s", "level": "%(levelname)s", '
'"service": "%(name)s", "message": "%(message)s"}'
)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
return logger
def log_with_context(self, level: str, message: str, **context):
"""Log with additional context"""
log_func = getattr(self.logger, level)
log_func(message, extra=context)
def track_custom_metric(self, metric_name: str, value: float,
tags: Dict[str, str]):
"""Track custom application metric"""
# Implementation would send to metrics backend
pass
def create_distributed_trace(self, operation_name: str):
"""Create distributed trace span"""
# Implementation would use OpenTelemetry or similar
pass
```
## Best Practices
### Culture & Process
- Foster collaboration between Dev and Ops
- Automate everything possible
- Measure and monitor continuously
- Practice blameless post-mortems
- Share knowledge and documentation
- Encourage experimentation
- Celebrate successes and learn from failures
### CI/CD
- Keep builds fast (<10 minutes)
- Run tests in parallel
- Use pipeline as code
- Implement automated rollbacks
- Require code review before merge
- Use trunk-based development
- Deploy small, frequent changes
### Infrastructure
- Use Infrastructure as Code
- Version everything (code, config, infrastructure)
- Implement disaster recovery
- Practice chaos engineering
- Use immutable infrastructure
- Automate security scanning
- Monitor cloud costs
## Anti-Patterns
❌ Manual deployments
❌ Configuration drift
❌ No automated testing
❌ Long-lived feature branches
❌ Blame culture
❌ Siloed teams
❌ Ignoring technical debt
## Resources
- The Phoenix Project: https://itrevolution.com/the-phoenix-project/
- DevOps Handbook: https://itrevolution.com/the-devops-handbook/
- State of DevOps Report: https://www.devops-research.com/research.html
- GitLab CI/CD: https://docs.gitlab.com/ee/ci/
- GitHub Actions: https://docs.github.com/en/actions
This skill provides expert-level guidance on DevOps practices, culture, automation, CI/CD pipelines, infrastructure as code, and operational excellence. It focuses on practical patterns and repeatable workflows to help teams deliver reliable software faster while improving collaboration and measuring outcomes. The content covers pipeline examples, IaC snippets, deployment strategies, configuration management, and observability recommendations.
I inspect and synthesize core DevOps domains: culture and process, automation (IaC, config management, CI/CD), deployment strategies (blue/green, canary, rolling, feature flags), and monitoring/observability. I provide concrete pipeline and infrastructure examples, validate configuration patterns, and recommend operational controls like automated rollbacks, testing gates, and telemetry. The guidance is actionable: apply snippets, adapt strategies to your environment, and integrate metrics and blameless practices into your workflow.
How do I choose between blue/green, canary, and rolling deployments?
Choose based on risk tolerance and infrastructure: blue/green gives fast rollback and isolation, canary minimizes blast radius and requires good metrics, rolling is resource-efficient for instance-by-instance updates. Use feature flags for gradual exposure.
What are the most important metrics to monitor during deployments?
Track error rate, p95/p99 latency, throughput, resource saturation, and deployment success rate. Correlate these with traces and logs to enable fast diagnosis and automated rollback decisions.