home / skills / personamanagmentlayer / pcl / devops-expert

devops-expert skill

Q: How do I choose between blue/green, canary, and rolling deployments?

Choose based on risk tolerance and infrastructure: blue/green gives fast rollback and isolation, canary minimizes blast radius and requires good metrics, rolling is resource-efficient for instance-by-instance updates. Use feature flags for gradual exposure.

Q: What are the most important metrics to monitor during deployments?

Track error rate, p95/p99 latency, throughput, resource saturation, and deployment success rate. Correlate these with traces and logs to enable fast diagnosis and automated rollback decisions.

safe

/stdlib/devops/devops-expert

This skill provides expert DevOps guidance on automation, CI/CD pipelines, and infrastructure for faster, reliable deployments.

npx playbooks add skill personamanagmentlayer/pcl --skill devops-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

13.0 KB

---
name: devops-expert
version: 1.0.0
description: Expert-level DevOps practices, culture, automation, and continuous delivery
category: devops
tags: [devops, ci-cd, automation, infrastructure, culture]
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash(*)
---

# DevOps Expert

Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.

## Core Concepts

### DevOps Culture
- Collaboration and communication
- Shared responsibility
- Continuous improvement
- Breaking down silos
- Blameless culture
- Measuring everything

### Automation
- Infrastructure as Code (IaC)
- Configuration management
- Deployment automation
- Testing automation
- Monitoring automation
- Self-service platforms

### CI/CD
- Continuous Integration
- Continuous Delivery
- Continuous Deployment
- Pipeline as Code
- Artifact management
- Release strategies

## CI/CD Pipeline

```yaml
# GitHub Actions Example
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linting
        run: npm run lint

      - name: Run tests
        run: npm test

      - name: Run security scan
        run: npm audit

      - name: Upload coverage
        uses: codecov/codecov-action@v3

  build:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - name: Deploy to staging
        run: |
          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=staging

      - name: Wait for rollout
        run: kubectl rollout status deployment/myapp -n staging

      - name: Run smoke tests
        run: npm run test:smoke

  deploy-production:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production

    steps:
      - name: Deploy to production
        run: |
          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace=production

      - name: Wait for rollout
        run: kubectl rollout status deployment/myapp -n production
```

## Infrastructure as Code

```python
# Pulumi Infrastructure as Code
import pulumi
import pulumi_aws as aws

# VPC
vpc = aws.ec2.Vpc("main-vpc",
    cidr_block="10.0.0.0/16",
    enable_dns_hostnames=True,
    enable_dns_support=True,
    tags={"Name": "main-vpc"})

# Subnets
public_subnet = aws.ec2.Subnet("public-subnet",
    vpc_id=vpc.id,
    cidr_block="10.0.1.0/24",
    availability_zone="us-east-1a",
    map_public_ip_on_launch=True,
    tags={"Name": "public-subnet"})

private_subnet = aws.ec2.Subnet("private-subnet",
    vpc_id=vpc.id,
    cidr_block="10.0.2.0/24",
    availability_zone="us-east-1b",
    tags={"Name": "private-subnet"})

# Internet Gateway
igw = aws.ec2.InternetGateway("igw",
    vpc_id=vpc.id,
    tags={"Name": "main-igw"})

# Route Table
route_table = aws.ec2.RouteTable("public-rt",
    vpc_id=vpc.id,
    routes=[
        aws.ec2.RouteTableRouteArgs(
            cidr_block="0.0.0.0/0",
            gateway_id=igw.id,
        )
    ],
    tags={"Name": "public-rt"})

# Security Group
security_group = aws.ec2.SecurityGroup("web-sg",
    vpc_id=vpc.id,
    description="Allow HTTP and HTTPS traffic",
    ingress=[
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=80,
            to_port=80,
            cidr_blocks=["0.0.0.0/0"],
        ),
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=443,
            to_port=443,
            cidr_blocks=["0.0.0.0/0"],
        ),
    ],
    egress=[
        aws.ec2.SecurityGroupEgressArgs(
            protocol="-1",
            from_port=0,
            to_port=0,
            cidr_blocks=["0.0.0.0/0"],
        )
    ])

# EKS Cluster
cluster = aws.eks.Cluster("app-cluster",
    role_arn=cluster_role.arn,
    vpc_config=aws.eks.ClusterVpcConfigArgs(
        subnet_ids=[public_subnet.id, private_subnet.id],
        security_group_ids=[security_group.id],
    ))

# Export outputs
pulumi.export("vpc_id", vpc.id)
pulumi.export("cluster_name", cluster.name)
pulumi.export("cluster_endpoint", cluster.endpoint)
```

## Deployment Strategies

```python
from typing import List, Dict
import time

class DeploymentStrategy:
    """Implement various deployment strategies"""

    def __init__(self, service_name: str):
        self.service_name = service_name

    def blue_green_deployment(self, blue_version: str, green_version: str):
        """Blue-Green deployment"""
        # Deploy green environment
        self.deploy_environment("green", green_version)

        # Run tests on green
        if self.run_tests("green"):
            # Switch traffic to green
            self.switch_traffic("green")

            # Keep blue for rollback
            print(f"Deployment successful. Blue ({blue_version}) kept for rollback.")
        else:
            # Rollback - keep blue active
            print("Tests failed on green. Keeping blue active.")

    def canary_deployment(self, current_version: str, new_version: str,
                         canary_percentage: int = 10):
        """Canary deployment"""
        # Deploy canary with small percentage
        self.deploy_canary(new_version, canary_percentage)

        # Monitor metrics
        metrics = self.monitor_canary_metrics(duration_minutes=10)

        if metrics['error_rate'] < 0.1 and metrics['latency_p95'] < 500:
            # Gradually increase canary traffic
            for percentage in [25, 50, 75, 100]:
                self.update_canary_traffic(percentage)
                time.sleep(300)  # 5 minutes between increases

                if not self.check_health():
                    self.rollback(current_version)
                    return False

            print(f"Canary deployment successful: {new_version}")
            return True
        else:
            self.rollback(current_version)
            print("Canary deployment failed - rolled back")
            return False

    def rolling_deployment(self, version: str, batch_size: int = 1):
        """Rolling deployment"""
        instances = self.get_instances()

        for i in range(0, len(instances), batch_size):
            batch = instances[i:i + batch_size]

            # Update batch
            for instance in batch:
                self.update_instance(instance, version)
                self.wait_for_healthy(instance)

            # Verify batch health
            if not self.check_health():
                print(f"Rolling deployment failed at batch {i//batch_size + 1}")
                return False

        print(f"Rolling deployment successful: {version}")
        return True

    def feature_flag_deployment(self, feature_name: str, enabled: bool,
                               rollout_percentage: int = 100):
        """Feature flag based deployment"""
        return {
            'feature': feature_name,
            'enabled': enabled,
            'rollout_percentage': rollout_percentage,
            'targeting': {
                'user_segments': ['beta_users'] if rollout_percentage < 100 else ['all']
            }
        }
```

## Configuration Management

```python
from typing import Dict, Any
import yaml
import json

class ConfigurationManager:
    """Manage application configuration"""

    def __init__(self, environment: str):
        self.environment = environment
        self.config = {}

    def load_config(self, config_file: str):
        """Load configuration from file"""
        with open(config_file, 'r') as f:
            if config_file.endswith('.yaml') or config_file.endswith('.yml'):
                self.config = yaml.safe_load(f)
            elif config_file.endswith('.json'):
                self.config = json.load(f)

    def get(self, key: str, default: Any = None) -> Any:
        """Get configuration value"""
        keys = key.split('.')
        value = self.config

        for k in keys:
            if isinstance(value, dict):
                value = value.get(k)
            else:
                return default

            if value is None:
                return default

        return value

    def merge_environment_config(self, env_config: Dict):
        """Merge environment-specific configuration"""
        self.config = self._deep_merge(self.config, env_config)

    def _deep_merge(self, base: Dict, override: Dict) -> Dict:
        """Deep merge two dictionaries"""
        result = base.copy()

        for key, value in override.items():
            if key in result and isinstance(result[key], dict) and isinstance(value, dict):
                result[key] = self._deep_merge(result[key], value)
            else:
                result[key] = value

        return result

    def validate_required_keys(self, required_keys: List[str]) -> List[str]:
        """Validate that required configuration keys exist"""
        missing = []

        for key in required_keys:
            if self.get(key) is None:
                missing.append(key)

        return missing
```

## Monitoring and Observability

```python
import logging
from opencensus.ext.azure import metrics_exporter
from opencensus.stats import aggregation as aggregation_module
from opencensus.stats import measure as measure_module
from opencensus.stats import stats as stats_module
from opencensus.stats import view as view_module
from opencensus.tags import tag_map as tag_map_module

class ObservabilityStack:
    """Implement observability best practices"""

    def __init__(self):
        self.logger = self._setup_logging()
        self.stats = stats_module.stats
        self.view_manager = self.stats.view_manager

    def _setup_logging(self) -> logging.Logger:
        """Setup structured logging"""
        logger = logging.getLogger(__name__)
        handler = logging.StreamHandler()

        formatter = logging.Formatter(
            '{"time": "%(asctime)s", "level": "%(levelname)s", '
            '"service": "%(name)s", "message": "%(message)s"}'
        )
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        logger.setLevel(logging.INFO)

        return logger

    def log_with_context(self, level: str, message: str, **context):
        """Log with additional context"""
        log_func = getattr(self.logger, level)
        log_func(message, extra=context)

    def track_custom_metric(self, metric_name: str, value: float,
                           tags: Dict[str, str]):
        """Track custom application metric"""
        # Implementation would send to metrics backend
        pass

    def create_distributed_trace(self, operation_name: str):
        """Create distributed trace span"""
        # Implementation would use OpenTelemetry or similar
        pass
```

## Best Practices

### Culture & Process
- Foster collaboration between Dev and Ops
- Automate everything possible
- Measure and monitor continuously
- Practice blameless post-mortems
- Share knowledge and documentation
- Encourage experimentation
- Celebrate successes and learn from failures

### CI/CD
- Keep builds fast (<10 minutes)
- Run tests in parallel
- Use pipeline as code
- Implement automated rollbacks
- Require code review before merge
- Use trunk-based development
- Deploy small, frequent changes

### Infrastructure
- Use Infrastructure as Code
- Version everything (code, config, infrastructure)
- Implement disaster recovery
- Practice chaos engineering
- Use immutable infrastructure
- Automate security scanning
- Monitor cloud costs

## Anti-Patterns

❌ Manual deployments
❌ Configuration drift
❌ No automated testing
❌ Long-lived feature branches
❌ Blame culture
❌ Siloed teams
❌ Ignoring technical debt

## Resources

- The Phoenix Project: https://itrevolution.com/the-phoenix-project/
- DevOps Handbook: https://itrevolution.com/the-devops-handbook/
- State of DevOps Report: https://www.devops-research.com/research.html
- GitLab CI/CD: https://docs.gitlab.com/ee/ci/
- GitHub Actions: https://docs.github.com/en/actions

Overview

This skill provides expert-level guidance on DevOps practices, culture, automation, CI/CD pipelines, infrastructure as code, and operational excellence. It focuses on practical patterns and repeatable workflows to help teams deliver reliable software faster while improving collaboration and measuring outcomes. The content covers pipeline examples, IaC snippets, deployment strategies, configuration management, and observability recommendations.

How this skill works

I inspect and synthesize core DevOps domains: culture and process, automation (IaC, config management, CI/CD), deployment strategies (blue/green, canary, rolling, feature flags), and monitoring/observability. I provide concrete pipeline and infrastructure examples, validate configuration patterns, and recommend operational controls like automated rollbacks, testing gates, and telemetry. The guidance is actionable: apply snippets, adapt strategies to your environment, and integrate metrics and blameless practices into your workflow.

When to use it

Designing or improving CI/CD pipelines for production workloads
Adopting Infrastructure as Code for repeatable, versioned infrastructure
Choosing a deployment strategy for zero-downtime releases
Building observability and structured logging into services
Establishing DevOps culture, shared ownership, and continuous improvement

Best practices

Automate everything: builds, tests, deployments, and monitoring
Use pipeline-as-code and keep builds fast (<10 minutes) with parallel tests
Version control infrastructure and configuration; apply IaC consistently
Implement automated rollbacks, smoke tests, and progressive rollouts
Measure key metrics (error rate, p95 latency, deployment frequency) and run blameless post-mortems

Example use cases

Implementing a GitHub Actions CI/CD pipeline that runs lint, tests, security scans, builds images, and deploys to staging/production
Provisioning cloud networks and clusters with Pulumi to keep infrastructure in source control
Rolling out a canary deployment that monitors error rate and latency before promoting traffic
Managing environment-specific configuration with a ConfigurationManager that merges and validates required keys
Adding structured logging and custom metrics to support SRE and incident response workflows

FAQ

How do I choose between blue/green, canary, and rolling deployments?

Choose based on risk tolerance and infrastructure: blue/green gives fast rollback and isolation, canary minimizes blast radius and requires good metrics, rolling is resource-efficient for instance-by-instance updates. Use feature flags for gradual exposure.

What are the most important metrics to monitor during deployments?

Track error rate, p95/p99 latency, throughput, resource saturation, and deployment success rate. Correlate these with traces and logs to enable fast diagnosis and automated rollback decisions.