home / skills / first-fluke / fullstack-starter / devops-iac-engineer

devops-iac-engineer skill

/.agents/skills/devops-iac-engineer

This skill guides designing and operating cloud infrastructure with IaC, CI/CD, observability, and SRE practices for reliable deployments.

npx playbooks add skill first-fluke/fullstack-starter --skill devops-iac-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.3 KB
---
name: devops-iac-engineer
description: Expert guidance for designing, implementing, and maintaining cloud infrastructure using Experience in Infrastructure as Code (IaC) principles. Use this skill for architecting cloud solutions, setting up CI/CD pipelines, implementing observability, and following SRE best practices.
---

# DevOps IaC Engineer

This skill provides expertise in designing and managing cloud infrastructure using Infrastructure as Code (IaC) and DevOps/SRE best practices.

## When to Use

- Designing cloud architecture (AWS, GCP, Azure)
- Implementing or refactoring CI/CD pipelines
- Setting up observability (logging, metrics, tracing)
- Creating Kubernetes clusters and container orchestration strategies
- Implementing security controls and compliance checks
- Improving system reliability (SLO/SLA, Disaster Recovery)

## Infrastructure as Code (IaC) Principles

- **Declarative Code**: Use Terraform/OpenTofu to define the desired state.
- **GitOps**: Code repository is the single source of truth. Changes are applied via PRs and automated pipelines.
- **Immutable Infrastructure**: Replace servers/containers rather than patching them in place.

## Core Domains

### 1. Terraform & IaC
- Use modules for reusability.
- Separate state by environment (dev, stage, prod) and region.
- Automate `plan` and `apply` in CI/CD.

### 2. Kubernetes & Containers
- Build small, stateless containers.
- Use Helm or Kustomize for resource management.
- Implement resource limits and requests.
- Use namespaces for isolation.

### 3. CI/CD Pipelines
- **CI**: Lint, test, build, and scan (security) on every commit.
- **CD**: Automated deployment to lower environments; manual approval for production.
- Use tools like GitHub Actions, Cloud Build, or ArgoCD.

### 4. Observability
- **Logs**: Centralized logging (e.g., Cloud Logging, ELK).
- **Metrics**: Prometheus/Grafana or Cloud Monitoring.
- **Tracing**: OpenTelemetry for distributed tracing.

### 5. Security (DevSecOps)
- Scan IaC for misconfigurations (e.g., Checkov, Trivy).
- Manage secrets utilizing Secret Manager or Vault (never in code).
- Least privilege IAM roles.

## SRE Practices

- **SLI/SLO**: Define Service Level Indicators and Objectives for critical user journeys.
- **Error Budgets**: Use error budgets to balance innovation and reliability.
- **Post-Mortems**: Conduct blameless post-mortems for incidents.

Overview

This skill provides hands-on guidance for designing, implementing, and operating cloud infrastructure using Infrastructure as Code (IaC) and SRE/DevOps best practices. It focuses on practical patterns for Terraform, Kubernetes, CI/CD, observability, and security across cloud providers. Use it to move from ad hoc scripts to reproducible, production-ready infrastructure and pipelines.

How this skill works

The skill inspects your architecture goals and current practices, then recommends declarative IaC patterns, module boundaries, and state isolation strategies. It outlines CI/CD flows that automate plan/apply, security scans, and promotion gates, and it prescribes observability and SRE controls like SLOs, error budgets, and incident processes. It includes concrete implementation advice for Terraform, container orchestration, and secret management.

When to use it

  • Designing or refactoring cloud architecture across AWS, GCP, or Azure
  • Creating reusable Terraform modules and separating state per environment
  • Building CI/CD pipelines that run lint, tests, security scans, and automated deploys
  • Setting up Kubernetes clusters, namespace strategies, and resource policies
  • Implementing centralized logging, metrics, and distributed tracing
  • Operationalizing SRE practices: SLI/SLOs, error budgets, and blameless post-mortems

Best practices

  • Treat Git as the single source of truth and apply changes via PR-driven pipelines (GitOps)
  • Keep infrastructure declarative and idempotent; prefer replacing resources over in-place edits
  • Use modular Terraform with clear boundaries and state separation by env and region
  • Enforce security by scanning IaC, storing secrets in a secret manager, and applying least-privilege IAM
  • Automate observability: collect logs, expose metrics, and instrument traces with OpenTelemetry
  • Define SLOs and track error budgets to guide release cadence and reliability trade-offs

Example use cases

  • Convert ad hoc cloud scripts into modular Terraform with remote state and CI-based plan/apply
  • Design a GitHub Actions pipeline that lints, tests, scans, builds containers, and deploys to Kubernetes
  • Implement cluster onboarding: namespaces, resource quotas, network policies, and RBAC
  • Add end-to-end observability: central logging, Prometheus metrics, Grafana dashboards, and tracing
  • Establish SLOs for critical user journeys and create an incident response runbook with post-mortem templates

FAQ

Which IaC tool should I choose, Terraform or OpenTofu?

Choose the tool with the community and provider support you need; both follow declarative patterns and modules. Prefer the ecosystem that matches your organizational requirements and CI integrations.

How do I manage secrets without exposing them in code?

Use a dedicated secret manager or Vault, grant minimal access via IAM roles, and inject secrets at runtime through the CI/CD or orchestration platform.