home / skills / 404kidwiz / claude-supercode-skills / terraform-engineer-skill

terraform-engineer-skill skill

This skill provides Terraform/OpenTofu IaC expertise to design modular infrastructure, manage state, and enable GitOps automation across cloud environments.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill terraform-engineer-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

10.5 KB

---
name: terraform-engineer
description: Infrastructure as Code (IaC) expert using Terraform/OpenTofu, HCL, and modern state management.
---

# Terraform Engineer

## Purpose

Provides Infrastructure as Code expertise specializing in Terraform and OpenTofu for cloud provisioning. Designs modular, scalable infrastructure with proper state management, remote backends, and GitOps-driven automation pipelines.

## When to Use

- Provisioning new cloud infrastructure (VPCs, EKS, RDS)
- Refactoring monolithic Terraform code into reusable modules
- Implementing "GitOps" for infrastructure (Atlantis/TFC)
- Managing remote state, locking, and backend configuration
- Writing custom providers or complex HCL logic (loops, conditionals)
- Migrating/importing existing manual infrastructure into Terraform

## Examples

### Example 1: Multi-Cloud Landing Zone

**Scenario:** Building a secure, compliant multi-cloud landing zone.

**Implementation:**
1. Created reusable modules for VPC, IAM, security groups
2. Implemented remote state with S3 backend and DynamoDB locking
3. Added variable validation and preconditions
4. Implemented cost estimation and budget alerts
5. Set up Terraform Cloud for state management

**Results:**
- Infrastructure provisioning reduced from weeks to hours
- 100% consistency across environments
- Security compliance automated
- 40% reduction in cloud costs through optimization

### Example 2: Kubernetes Platform with EKS

**Scenario:** Building a production-ready Kubernetes platform.

**Implementation:**
1. Created EKS module with managed node groups
2. Implemented RBAC and service accounts
3. Added network policies and security groups
4. Configured secrets management with Vault integration
5. Set up monitoring and observability

**Results:**
- Platform deployment in under 30 minutes
- Zero configuration drift
- Built-in security controls
- Clear upgrade path for K8s versions

### Example 3: Legacy Infrastructure Migration

**Scenario:** Importing manually provisioned infrastructure into Terraform.

**Implementation:**
1. Used terraform import for existing resources
2. Created corresponding Terraform configurations
3. Implemented state mv for resource reorganization
4. Verified no changes during import
5. Established Terraform as source of truth

**Results:**
- 200+ resources migrated to Terraform
- Infrastructure now version controlled
- Enables infrastructure as code workflows
- Improved audit and compliance

## Best Practices

### State Management

- **Remote Backend**: Always use remote state (S3, GCS, Terraform Cloud)
- **State Locking**: Prevent concurrent modifications
- **State Isolation**: Separate state for environments
- **Backup**: Enable state versioning

### Module Development

- **Single Responsibility**: Each module does one thing well
- **Version Pinning**: Lock module versions
- **Documentation**: Document inputs, outputs, behavior
- **Testing**: Test modules before publishing

### Code Quality

- **Formatting**: Use terraform fmt consistently
- **Validation**: Run terraform validate
- **Linting**: Use tflint for provider-specific issues
- **Security Scanning**: Use tfsec/checkov

### Collaboration

- **Code Review**: All changes reviewed before merge
- **Workspace Strategy**: Use workspaces for environment isolation
- **Variable Management**: Use variable files, not hardcoding
- **Output Documentation**: Document important outputs

---
---

## 2. Decision Framework

### State Management Strategy

| Scale | Strategy | Backend |
|-------|----------|---------|
| **Individual** | Local State | `local` (Not recommended for prod) |
| **Small Team** | Remote State + Locking | `s3` + DynamoDB (AWS) / `azurerm` (Azure) |
| **Enterprise** | Managed State + Runs | **Terraform Cloud** / **spacelift** / **env0** |
| **GitOps** | PR-driven Runs | **Atlantis** (Self-hosted) |

### Module Architecture

```
What are you building?
│
├─ **Root Module** (The "Glue")
│  ├─ `main.tf`: Instantiates child modules
│  ├─ `providers.tf`: Provider config
│  └─ `backend.tf`: State config
│
├─ **Child Modules** (Reusable)
│  ├─ **Resource Modules**: Wraps single resource (e.g., `s3-secure-bucket`)
│  │  └─ Enforces tagging, encryption, logging defaults.
│  │
│  └─ **Infrastructure Modules**: Logical group (e.g., `vpc-with-peering`)
│     └─ Combines VPC, Subnets, Route Tables, NAT Gateways.
│
└─ **Composition** (Terragrunt/Workspaces)
   ├─ `prod/`
   ├─ `stage/`
   └─ `dev/`
```

### Terraform vs. The World

| Tool | Approach | Best For |
|------|----------|----------|
| **Terraform** | HCL (Declarative) | Industry standard, massive ecosystem. |
| **Pulumi** | General Purpose Lang (TS/Py) | Devs who hate HCL, dynamic logic. |
| **Crossplane** | K8s Custom Resources | Control planes, self-service platforms. |
| **CloudFormation** | YAML/JSON | AWS purists (drift detection is native). |

**Red Flags → Escalate to `security-engineer`:**
- Hardcoded AWS keys in `provider` block
- State files stored in git (`terraform.tfstate`)
- Security Groups allowing `0.0.0.0/0` on SSH/RDP
- S3 buckets public by default

---
---

## 3. Core Workflows

### Workflow 1: Production AWS VPC (Modular)

**Goal:** Create a 3-tier VPC network using the community module.

**Steps:**

1.  **Dependency Definition (`versions.tf`)**
    ```hcl
    terraform {
      required_version = ">= 1.5.0"
      required_providers {
        aws = {
          source  = "hashicorp/aws"
          version = "~> 5.0"
        }
      }
    }
    ```

2.  **Implementation (`main.tf`)**
    ```hcl
    module "vpc" {
      source = "terraform-aws-modules/vpc/aws"
      version = "5.5.1"

      name = "prod-vpc"
      cidr = "10.0.0.0/16"

      azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
      private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
      public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

      enable_nat_gateway = true
      single_nat_gateway = false # High Availability
      enable_vpn_gateway = false

      tags = {
        Environment = "Production"
        Terraform   = "true"
      }
    }
    ```

3.  **Outputs (`outputs.tf`)**
    ```hcl
    output "vpc_id" {
      description = "The ID of the VPC"
      value       = module.vpc.vpc_id
    }
    ```

---
---

### Workflow 3: Importing Existing Infrastructure

**Goal:** Bring a manually created EC2 instance under Terraform control.

**Steps:**

1.  **Identify Resource ID**
    -   AWS Console → EC2 → Instance ID: `i-0123456789abcdef0`

2.  **Write Terraform Code**
    ```hcl
    resource "aws_instance" "legacy_server" {
      ami           = "ami-0c55b159cbfafe1f0"
      instance_type = "t2.micro"
      # Fill in other known details...
    }
    ```

3.  **Run Import**
    ```bash
    terraform import aws_instance.legacy_server i-0123456789abcdef0
    ```
    *(Or use `import` block in TF 1.5+)*
    ```hcl
    import {
      to = aws_instance.legacy_server
      id = "i-0123456789abcdef0"
    }
    ```

4.  **Reconcile**
    -   Run `terraform plan`.
    -   Update code to match the state until "No changes" is reported.

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Monolithic State File

**What it looks like:**
-   One `main.tf` controlling VPC, Database, EKS, and 50 Microservices.
-   `terraform plan` takes 10 minutes.

**Why it fails:**
-   **Blast Radius:** One error breaks everything.
-   **Performance:** API rate limits (AWS Throttling).
-   **Locking:** Dev A blocks Dev B.

**Correct approach:**
-   **Split State:** Separate `network`, `data`, `app-cluster`.
-   Use `terraform_remote_state` data source to read outputs from other layers.

### ❌ Anti-Pattern 2: Hardcoding Environments

**What it looks like:**
-   `vpc-prod.tf`, `vpc-dev.tf` files with duplicated code.

**Why it fails:**
-   Drift between environments.
-   Double maintenance.

**Correct approach:**
-   **Workspaces:** Use `terraform workspace` with `var.environment`.
-   **Tfvars:** `prod.tfvars` vs `dev.tfvars`.
-   **Modules:** Reuse the same logic, pass different variables.

### ❌ Anti-Pattern 3: Ignoring `.gitignore`

**What it looks like:**
-   Committing `.terraform/` directory (plugins).
-   Committing `terraform.tfvars` (secrets).

**Why it fails:**
-   Repo bloat.
-   Security leak.

**Correct approach:**
-   Standard `.gitignore` for Terraform:
    ```
    .terraform/
    *.tfstate
    *.tfstate.backup
    *.tfvars
    .terraform.lock.hcl (Commit this one!)
    ```

---
---

## 7. Quality Checklist

**Code Quality:**
-   [ ] **Formatting:** Run `terraform fmt -recursive`.
-   [ ] **Validation:** Run `terraform validate`.
-   [ ] **Linting:** Run `tflint` for provider-specific issues.
-   [ ] **Docs:** Generate README using `terraform-docs`.

**Security:**
-   [ ] **Secrets:** No plain text secrets (Use KMS/Vault/Secrets Manager).
-   [ ] **Encryption:** `encrypted = true` on all storage (EBS, S3, RDS).
-   [ ] **Public Access:** Locked down (S3 Block Public Access).

**Reliability:**
-   [ ] **State:** Remote backend configured with locking.
-   [ ] **Versions:** Provider and Terraform versions pinned (e.g., `~> 5.0`).
-   [ ] **Cleanup:** `destroy` provisioners tested (or protection enabled for DBs).

## Anti-Patterns

### State Management Anti-Patterns

- **Local State**: Using local state files - always use remote backends
- **State Drift**: Manual changes outside Terraform - use only Terraform for changes
- **State Lock Contention**: No state locking - implement proper locking
- **State Corruption**: Editing state files manually - never manually edit state

### Module Anti-Patterns

- **Monolithic Modules**: Large, unwieldy modules - split into focused modules
- **Hardcoded Values**: Using values instead of variables - parameterize everything
- **Module Version Chaos**: No version pinning - pin module versions
- **Deep Module Nesting**: Over-nested module structures - keep module hierarchy flat

### Resource Anti-Patterns

- **Resource Spam**: Many small resources instead of patterns - use resource grouping
- **Lifecycle Lock**: Resources that can't update - avoid create_before_destroy conflicts
- **Ignored Changes**: Overusing ignore_changes - understand and manage changes
- **Sensitive Data Exposure**: Plain text secrets in state - use sensitive flag

### Code Organization Anti-Patterns

- **Flat Structure**: No directory organization - use modular structure
- **Duplication**: Repeated code blocks - use modules and for_each
- **No Formatting**: Unformatted HCL code - use terraform fmt
- **Missing Documentation**: undocumented modules - document all inputs/outputs

Overview

This skill provides Infrastructure as Code expertise for Terraform and OpenTofu, focusing on modular, secure, and scalable cloud provisioning. I help design reusable modules, manage remote state and locking, and implement GitOps-driven pipelines for consistent deployments. The skill emphasizes best practices for state management, testing, and security to reduce drift and shorten delivery times.

How this skill works

I inspect existing HCL, Terraform/OpenTofu configurations, and remote backend setups to identify risks and refactor opportunities. The skill proposes module boundaries, backend recommendations (S3/GCS/Terraform Cloud), workspace strategies, and CI/CD integration patterns (Atlantis, Terraform Cloud runs). I also provide concrete migration steps for importing resources and reconciling state to make Terraform the single source of truth.

When to use it

Provision new cloud infrastructure (VPCs, EKS, RDS) with repeatable modules
Refactor monolithic Terraform into reusable, single-responsibility modules
Implement remote state, locking, and environment isolation for teams
Set up GitOps-driven infrastructure pipelines (Atlantis, Terraform Cloud)
Migrate manual resources into Terraform using terraform import and state moves
Build secure, production-ready Kubernetes platforms or landing zones

Best practices

Always use a remote backend with state locking and versioning (S3/GCS/Terraform Cloud)
Design modules with single responsibility, clear inputs/outputs, and version pinning
Enforce code quality: terraform fmt, terraform validate, tflint, and security scanning (tfsec/checkov)
Separate state per environment and use workspaces or composition tools to limit blast radius
Keep secrets out of code and state; use KMS/Vault/Secrets Manager and .gitignore for sensitive files

Example use cases

Create a multi-cloud landing zone with reusable VPC/IAM/security modules and centralized state management
Build a production EKS platform with RBAC, network policies, Vault integration, and observability
Import legacy EC2/RDS/Networking into Terraform and reconcile state until no-drift
Split a monolithic repo into layered states (network, data, apps) and wire outputs via terraform_remote_state
Implement PR-driven runs with Atlantis or Terraform Cloud to enforce review and automated applies

FAQ

How do I avoid a single monolithic state file?

Split infrastructure into logical layers (network, data, apps), use separate remote backends per layer, and share outputs via terraform_remote_state or remote state data sources.

What backend should a small team use?

Use an S3 (or GCS) remote backend with state locking (DynamoDB for AWS) for small teams; consider Terraform Cloud or a managed service for enterprise needs.