home / skills / aj-geddes / useful-ai-prompts / infrastructure-cost-optimization

infrastructure-cost-optimization skill

safe

/skills/infrastructure-cost-optimization

This skill helps you reduce cloud costs by rightsizing resources, leveraging reserved and spot instances, and eliminating waste across environments.

npx playbooks add skill aj-geddes/useful-ai-prompts --skill infrastructure-cost-optimization

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

12.2 KB

---
name: infrastructure-cost-optimization
description: Optimize cloud infrastructure costs through resource rightsizing, reserved instances, spot instances, and waste reduction strategies.
---

# Infrastructure Cost Optimization

## Overview

Reduce infrastructure costs through intelligent resource allocation, reserved instances, spot instances, and continuous optimization without sacrificing performance.

## When to Use

- Cloud cost reduction
- Budget management and tracking
- Resource utilization optimization
- Multi-environment cost allocation
- Waste identification and elimination
- Reserved instance planning
- Spot instance integration

## Implementation Examples

### 1. **AWS Cost Optimization Configuration**

```yaml
# cost-optimization-setup.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cost-optimization-scripts
  namespace: operations
data:
  analyze-costs.sh: |
    #!/bin/bash
    set -euo pipefail

    echo "=== AWS Cost Analysis ==="

    # Get daily cost trend
    echo "Daily costs for last 7 days:"
    aws ce get-cost-and-usage \
      --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
      --granularity DAILY \
      --metrics "BlendedCost" \
      --group-by Type=DIMENSION,Key=SERVICE \
      --query 'ResultsByTime[*].[TimePeriod.Start,Total.BlendedCost.Amount]' \
      --output table

    # Find unattached resources
    echo -e "\n=== Unattached EBS Volumes ==="
    aws ec2 describe-volumes \
      --filters Name=status,Values=available \
      --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
      --output table

    echo -e "\n=== Unattached Elastic IPs ==="
    aws ec2 describe-addresses \
      --filters Name=association-id,Values=none \
      --query 'Addresses[*].[PublicIp,AllocationId]' \
      --output table

    echo -e "\n=== Unused RDS Instances ==="
    aws rds describe-db-instances \
      --query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,DBInstanceClass,Engine,AllocatedStorage]' \
      --output table

    # Estimate savings with Reserved Instances
    echo -e "\n=== Reserved Instance Savings Potential ==="
    aws ce get-reservation-purchase-recommendation \
      --service "EC2" \
      --lookback-period THIRTY_DAYS \
      --query 'Recommendations[0].[RecommendationSummary.TotalEstimatedMonthlySavingsAmount,RecommendationSummary.TotalEstimatedMonthlySavingsPercentage]' \
      --output table

  optimize-resources.sh: |
    #!/bin/bash
    set -euo pipefail

    echo "Starting resource optimization..."

    # Remove unattached volumes
    echo "Removing unattached volumes..."
    aws ec2 describe-volumes \
      --filters Name=status,Values=available \
      --query 'Volumes[*].VolumeId' \
      --output text | \
    while read volume_id; do
      echo "Deleting volume: $volume_id"
      aws ec2 delete-volume --volume-id "$volume_id" 2>/dev/null || true
    done

    # Release unused Elastic IPs
    echo "Releasing unused Elastic IPs..."
    aws ec2 describe-addresses \
      --filters Name=association-id,Values=none \
      --query 'Addresses[*].AllocationId' \
      --output text | \
    while read alloc_id; do
      echo "Releasing EIP: $alloc_id"
      aws ec2 release-address --allocation-id "$alloc_id" 2>/dev/null || true
    done

    # Modify RDS to smaller instances
    echo "Analyzing RDS for downsizing..."
    # Implement logic to check CloudWatch metrics and downsize if needed

    echo "Optimization complete"

---
# Terraform cost optimization
resource "aws_ec2_instance" "spot" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  # Use spot instances for non-critical workloads
  instance_market_options {
    market_type = "spot"

    spot_options {
      max_price                      = "0.05"  # Set max price
      spot_instance_type             = "persistent"
      interrupt_behavior             = "terminate"
      valid_until                    = "2025-12-31T23:59:59Z"
    }
  }

  tags = {
    Name = "spot-instance"
    CostCenter = "engineering"
  }
}

# Reserved instance for baseline capacity
resource "aws_ec2_instance" "reserved" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  # Tag for reserved instance matching
  tags = {
    Name = "reserved-instance"
    ReservationType = "reserved"
  }
}

resource "aws_ec2_fleet" "mixed" {
  name = "mixed-capacity"

  launch_template_configs {
    launch_template_specification {
      launch_template_id = aws_launch_template.app.id
      version            = "$Latest"
    }

    overrides {
      instance_type       = "t3.medium"
      weighted_capacity   = "1"
      priority            = 1  # Reserved
    }

    overrides {
      instance_type       = "t3.large"
      weighted_capacity   = "2"
      priority            = 2  # Reserved
    }

    overrides {
      instance_type       = "t3a.medium"
      weighted_capacity   = "1"
      priority            = 3  # Spot
    }

    overrides {
      instance_type       = "t3a.large"
      weighted_capacity   = "2"
      priority            = 4  # Spot
    }
  }

  target_capacity_specification {
    total_target_capacity  = 10
    on_demand_target_capacity = 6
    spot_target_capacity = 4
    default_target_capacity_type = "on-demand"
  }

  fleet_type = "maintain"
}
```

### 2. **Kubernetes Cost Optimization**

```yaml
# k8s-cost-optimization.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cost-optimization-policies
  namespace: kube-system
data:
  policies.yaml: |
    # Resource quotas per namespace
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: compute-quota
      namespace: production
    spec:
      hard:
        requests.cpu: "100"
        requests.memory: "200Gi"
        limits.cpu: "200"
        limits.memory: "400Gi"
        pods: "500"
      scopeSelector:
        matchExpressions:
          - operator: In
            scopeName: PriorityClass
            values: ["high", "medium"]

---
# Pod Disruption Budget for cost-effective scaling
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cost-optimized-pdb
  namespace: production
spec:
  minAvailable: 1
  selector:
    matchLabels:
      tier: backend

---
# Prioritize spot instances with taints/tolerations
apiVersion: v1
kind: Node
metadata:
  name: spot-node-1
spec:
  taints:
    - key: cloud.google.com/gke-preemptible
      value: "true"
      effect: NoSchedule

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cost-optimized-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      # Tolerate spot instances
      tolerations:
        - key: cloud.google.com/gke-preemptible
          operator: Equal
          value: "true"
          effect: NoSchedule

      # Prefer nodes with lower cost
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: karpenter.sh/capacity-type
                    operator: In
                    values: ["spot"]

      containers:
        - name: app
          image: myapp:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
```

### 3. **Cost Monitoring Dashboard**

```python
# cost-monitoring.py
import boto3
import json
from datetime import datetime, timedelta

class CostOptimizer:
    def __init__(self):
        self.ce_client = boto3.client('ce')
        self.ec2_client = boto3.client('ec2')
        self.rds_client = boto3.client('rds')

    def get_daily_costs(self, days=30):
        """Get daily costs for past N days"""
        end_date = datetime.now().date()
        start_date = end_date - timedelta(days=days)

        response = self.ce_client.get_cost_and_usage(
            TimePeriod={
                'Start': str(start_date),
                'End': str(end_date)
            },
            Granularity='DAILY',
            Metrics=['BlendedCost'],
            GroupBy=[
                {'Type': 'DIMENSION', 'Key': 'SERVICE'}
            ]
        )

        return response

    def find_underutilized_instances(self):
        """Find EC2 instances with low CPU usage"""
        cloudwatch = boto3.client('cloudwatch')
        instances = []

        ec2_instances = self.ec2_client.describe_instances()
        for reservation in ec2_instances['Reservations']:
            for instance in reservation['Instances']:
                instance_id = instance['InstanceId']

                # Check CPU utilization
                response = cloudwatch.get_metric_statistics(
                    Namespace='AWS/EC2',
                    MetricName='CPUUtilization',
                    Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                    StartTime=datetime.now() - timedelta(days=7),
                    EndTime=datetime.now(),
                    Period=3600,
                    Statistics=['Average']
                )

                if response['Datapoints']:
                    avg_cpu = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints'])
                    if avg_cpu < 10:  # Less than 10% average
                        instances.append({
                            'InstanceId': instance_id,
                            'Type': instance['InstanceType'],
                            'AverageCPU': avg_cpu,
                            'Recommendation': 'Downsize or terminate'
                        })

        return instances

    def estimate_reserved_instance_savings(self):
        """Estimate potential savings from reserved instances"""
        response = self.ce_client.get_reservation_purchase_recommendation(
            Service='EC2',
            LookbackPeriod='THIRTY_DAYS',
            PageSize=100
        )

        total_savings = 0
        for recommendation in response.get('Recommendations', []):
            summary = recommendation['RecommendationSummary']
            savings = float(summary['EstimatedMonthlyMonthlySavingsAmount'])
            total_savings += savings

        return total_savings

    def generate_report(self):
        """Generate comprehensive cost optimization report"""
        print("=== Cost Optimization Report ===\n")

        # Daily costs
        print("Daily Costs:")
        costs = self.get_daily_costs(7)
        for result in costs['ResultsByTime']:
            date = result['TimePeriod']['Start']
            total = result['Total']['BlendedCost']['Amount']
            print(f"  {date}: ${total}")

        # Underutilized instances
        print("\nUnderutilized Instances:")
        underutilized = self.find_underutilized_instances()
        for instance in underutilized:
            print(f"  {instance['InstanceId']}: {instance['AverageCPU']:.1f}% CPU - {instance['Recommendation']}")

        # Reserved instance savings
        print("\nReserved Instance Savings Potential:")
        savings = self.estimate_reserved_instance_savings()
        print(f"  Estimated Monthly Savings: ${savings:.2f}")

# Usage
if __name__ == '__main__':
    optimizer = CostOptimizer()
    optimizer.generate_report()
```

## Cost Optimization Strategies

### ✅ DO
- Use reserved instances for baseline
- Leverage spot instances
- Right-size resources
- Monitor cost trends
- Implement auto-scaling
- Use multi-region pricing
- Tag resources consistently
- Schedule non-essential resources

### ❌ DON'T
- Over-provision resources
- Ignore unused resources
- Neglect cost monitoring
- Run all on-demand
- Forget to release EIPs
- Mix cost centers
- Ignore savings opportunities
- Deploy without budgets

## Cost Saving Opportunities

- **Reserved Instances**: 40-70% savings
- **Spot Instances**: 70-90% savings
- **Committed Use Discounts**: 25-55% savings
- **Right-sizing**: 10-30% savings
- **Resource cleanup**: 5-20% savings

## Resources

- [AWS Cost Optimization](https://aws.amazon.com/architecture/cost-optimization/)
- [GCP Cost Optimization](https://cloud.google.com/cost-management)
- [Azure Cost Management](https://docs.microsoft.com/en-us/azure/cost-management-billing/)
- [Kubernetes Cost Optimization](https://kubernetes.io/docs/tasks/debug-application-cluster/resource-cost/)

Overview

This skill helps teams reduce cloud infrastructure spend by applying rightsizing, reserved and spot instance strategies, and waste elimination. It provides actionable checks, automation snippets, and monitoring guidance to lower costs without degrading performance. The goal is continuous cost control through detection, remediation, and capacity planning.

How this skill works

The skill inspects cloud usage, idle and unattached resources, and workload sizing to produce recommendations for downsizing, instance type changes, and purchase options (reserved/spot). It includes scripts and terraform/kubernetes patterns to automate cleanup, enforce resource quotas, and prefer lower-cost capacity types. It also shows how to estimate reserved-instance savings and generate cost optimization reports.

When to use it

When monthly cloud spend exceeds budget or growth is unexplained
During architecture reviews to select a mix of reserved, on-demand, and spot capacity
To identify and remove unattached volumes, idle instances, and orphaned IPs
When implementing Kubernetes scheduling policies to use spot nodes safely
For quarterly cost audits and reserved instance planning

Best practices

Tag resources consistently to map costs to teams and projects
Use reserved instances or committed use for predictable baseline capacity
Run rightsizing analysis regularly and act on underutilized instances
Schedule non-production resources to stop outside business hours
Combine spot instances with a stable reserved/on-demand baseline and use graceful interruption handling

Example use cases

Weekly script that lists unattached EBS volumes, unused Elastic IPs, and candidate RDS downsizes for manual review or automated removal
Terraform patterns deploying a mixed fleet: reserved capacity for baseline and spot overrides for burstable workload
Kubernetes deployment using taints/tolerations and node affinity to prefer spot capacity while maintaining availability with PDBs and quotas
Automated cost report that queries cost explorer and CloudWatch to list underutilized EC2 instances and estimated RI savings
Policy to auto-stop development clusters at night and release unused public IPs to reduce waste

FAQ

How much can I realistically save?

Savings depend on workload and commitment: reserved instances typically yield 40–70%, spot can cut compute costs by 70–90%, right-sizing adds 10–30%, and cleanup yields additional single-digit to low-double-digit savings.

Are spot instances safe for production?

Yes for fault-tolerant or stateless workloads when combined with strategies like mixed fleets, interruption handling, and a reserved/on-demand baseline.

How often should cost optimization run?

Run automated scans weekly, perform deeper rightsizing and reserved-instance planning monthly or quarterly, and review tagging and budgets continuously.