home / skills / 89jobrien / steve / cloud-infrastructure

cloud-infrastructure skill

safe

This skill helps you design multi-cloud architectures and implement IaC with Terraform, optimizing costs and enabling resilient deployments.

npx playbooks add skill 89jobrien/steve --skill cloud-infrastructure

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.0 KB

---
name: cloud-infrastructure
description: Cloud infrastructure design and deployment patterns for AWS, Azure, and
  GCP. Use when designing cloud architectures, implementing IaC with Terraform, optimizing
  costs, or setting up multi-region deployments.
author: Joseph OBrien
status: unpublished
updated: '2025-12-23'
version: 1.0.1
tag: skill
type: skill
---

# Cloud Infrastructure

Comprehensive cloud infrastructure skill covering multi-cloud architecture, Infrastructure as Code, cost optimization, and production deployment patterns.

## When to Use This Skill

- Designing cloud architecture for new applications
- Implementing Infrastructure as Code (Terraform, CloudFormation, Pulumi)
- Cost optimization and resource right-sizing
- Multi-region and high-availability deployments
- Cloud migration planning
- Security and compliance implementation
- Auto-scaling and performance optimization

## Cloud Architecture Patterns

### Compute Patterns

| Pattern | AWS | Azure | GCP | Use Case |
|---------|-----|-------|-----|----------|
| Serverless | Lambda | Functions | Cloud Functions | Event-driven, variable load |
| Containers | ECS/EKS | AKS | GKE | Microservices, consistent env |
| VMs | EC2 | Virtual Machines | Compute Engine | Legacy apps, full control |
| Batch | Batch | Batch | Batch | Large-scale processing |

### Storage Patterns

| Type | AWS | Azure | GCP | Use Case |
|------|-----|-------|-----|----------|
| Object | S3 | Blob Storage | Cloud Storage | Static files, backups |
| Block | EBS | Managed Disks | Persistent Disk | Database storage |
| File | EFS | Azure Files | Filestore | Shared file systems |
| Archive | Glacier | Archive | Coldline | Long-term retention |

### Database Patterns

| Type | AWS | Azure | GCP | Use Case |
|------|-----|-------|-----|----------|
| Relational | RDS, Aurora | SQL Database | Cloud SQL | ACID transactions |
| NoSQL | DynamoDB | Cosmos DB | Firestore | Flexible schema |
| Cache | ElastiCache | Cache for Redis | Memorystore | Session, caching |
| Data Warehouse | Redshift | Synapse | BigQuery | Analytics |

## Infrastructure as Code

### Terraform Best Practices

**Project Structure:**

```
infrastructure/
├── modules/
│   ├── networking/
│   ├── compute/
│   └── database/
├── environments/
│   ├── dev/
│   ├── staging/
│   └── prod/
├── main.tf
├── variables.tf
├── outputs.tf
└── versions.tf
```

**State Management:**

- Use remote state (S3, Azure Blob, GCS)
- Enable state locking (DynamoDB, Blob lease)
- Separate state per environment
- Never commit state files

**Module Design:**

- Single responsibility per module
- Expose minimal required variables
- Document inputs/outputs
- Version modules with git tags

### Cost Optimization

**Compute Savings:**

- Reserved Instances (1-3 year commitment): 30-60% savings
- Spot/Preemptible instances: 60-90% savings for interruptible workloads
- Right-sizing: Match instance size to actual usage
- Auto-scaling: Scale down during low usage

**Storage Savings:**

- Lifecycle policies: Auto-transition to cheaper tiers
- Compression: Reduce storage footprint
- Deduplication: Eliminate redundant data
- Delete unused resources: Orphaned volumes, snapshots

**Network Savings:**

- Use CDN for static content
- Optimize data transfer paths
- Use private endpoints
- Compress API responses

## High Availability Patterns

### Multi-AZ Deployment

- Deploy across 2-3 availability zones
- Use load balancers for distribution
- Database replication across AZs
- Automatic failover configuration

### Multi-Region Deployment

- Active-active or active-passive
- DNS-based routing (Route53, Traffic Manager)
- Data replication strategy
- Disaster recovery procedures

### Resilience Patterns

- Circuit breakers for external dependencies
- Retry with exponential backoff
- Bulkhead isolation
- Graceful degradation

## Security Best Practices

### Identity & Access

- Principle of least privilege
- Use IAM roles, not long-term credentials
- Enable MFA for privileged accounts
- Regular access reviews

### Network Security

- VPC/VNet isolation
- Security groups as firewalls
- Private subnets for backend services
- VPN/Direct Connect for hybrid

### Data Protection

- Encryption at rest (KMS)
- Encryption in transit (TLS)
- Key rotation policies
- Backup and recovery testing

## Monitoring & Observability

### Key Metrics

- CPU, Memory, Disk utilization
- Network throughput and latency
- Error rates and types
- Cost per service/team

### Alerting Strategy

- Set thresholds based on baselines
- Alert on symptoms, not causes
- Runbooks for each alert
- Escalation paths defined

## Reference Files

- **`references/terraform_patterns.md`** - IaC patterns and examples
- **`references/cost_optimization.md`** - Detailed cost reduction strategies

## Integration with Other Skills

- **security-engineering** - For security architecture
- **network-engineering** - For network design
- **performance** - For optimization strategies
- **devops-runbooks** - For operational procedures

Overview

This skill provides practical cloud infrastructure design and deployment patterns for AWS, Azure, and GCP. It focuses on multi-cloud architecture, Infrastructure as Code (Terraform), cost optimization, and production-ready high-availability patterns. Use it to plan architecture, implement IaC, reduce costs, and harden cloud deployments.

How this skill works

The skill inspects architecture goals and recommends patterns for compute, storage, databases, networking, and resilience across AWS, Azure, and GCP. It maps use cases (serverless, containers, VMs, batch) to cloud services and supplies Terraform project layout, state management, and module design guidance. It also evaluates cost-saving levers, HA/multi-region strategies, security controls, and monitoring practices.

When to use it

Designing cloud architecture for a new application or service
Implementing Infrastructure as Code with Terraform, CloudFormation, or Pulumi
Planning multi-region or high-availability deployments and DR
Optimizing cloud costs and right-sizing resources
Defining security, compliance, and operational monitoring requirements

Best practices

Structure IaC with reusable, single-responsibility modules and per-environment state
Use remote state with locking and separate state files for dev/staging/prod
Apply least privilege with IAM roles, enable MFA, and run regular access reviews
Deploy across multiple AZs, use load balancers, and replicate databases for failover
Automate cost controls: lifecycle policies, reserved/spot instances, right-sizing and autoscaling

Example use cases

Design an event-driven serverless backend using Lambda/Functions/Cloud Functions with S3/Blob/Cloud Storage for objects
Build a containerized microservices platform on EKS/AKS/GKE with shared file storage and managed databases
Migrate an on-premises database to managed Cloud SQL/RDS/Aurora with automated backups and multi-AZ replication
Create a Terraform repository layout with modules for networking, compute, and databases plus remote state and locking
Implement cost optimization: identify idle resources, switch suitable workloads to spot/preemptible instances, apply lifecycle rules for archives

FAQ

Which pattern should I pick for variable workloads with unpredictable traffic?

Choose serverless or containers with autoscaling. Serverless for event-driven workloads and minimal ops; containers for long-running microservices requiring fine-grained control.

How should I manage Terraform state across environments?

Use remote state storage (S3/Blob/GCS) with state locking (DynamoDB/blob lease) and separate state per environment. Never commit state files to source control and version modules with tags.