home / skills / ancoleman / ai-design-components / optimizing-costs
This skill helps you reduce cloud spend using FinOps, right-sizing, and automated cost governance across multi-cloud environments.
npx playbooks add skill ancoleman/ai-design-components --skill optimizing-costsReview the files below or copy the command above to add this skill to your agents.
---
name: optimizing-costs
description: Optimize cloud infrastructure costs through FinOps practices, commitment discounts, right-sizing, and automated cost management. Use when reducing cloud spend, implementing budget controls, or establishing cost visibility across AWS, Azure, GCP, and Kubernetes environments.
---
# Cost Optimization
## Purpose
Cloud cost optimization transforms uncontrolled spending into strategic resource allocation through the FinOps lifecycle: Inform, Optimize, and Operate. This skill provides decision frameworks for commitment-based discounts (Reserved Instances, Savings Plans), right-sizing strategies, Kubernetes cost management, and automated cost governance across multi-cloud environments.
## When to Use This Skill
Invoke cost-optimization when:
- Reducing cloud spend by 15-40% through systematic optimization
- Implementing cost visibility dashboards and allocation tracking
- Establishing budget alerts and anomaly detection
- Optimizing Kubernetes resource requests and cluster efficiency
- Managing Reserved Instances, Savings Plans, or Committed Use Discounts
- Automating idle resource cleanup and right-sizing recommendations
- Setting up showback/chargeback models for internal teams
- Preventing cost overruns through CI/CD cost estimation (Infracost)
- Responding to finance team requests for cloud cost reduction
## FinOps Principles
### The FinOps Lifecycle
```
┌─────────────────────────────────────────────────────┐
│ INFORM → OPTIMIZE → OPERATE (continuous loop) │
│ ↓ ↓ ↓ │
│ Visibility Action Automation │
└─────────────────────────────────────────────────────┘
```
**Inform Phase:** Establish cost visibility
- Enable cost allocation tags (Owner, Project, Environment)
- Deploy real-time cost dashboards for engineering teams
- Integrate cloud billing data (AWS CUR, Azure Consumption API, GCP BigQuery)
- Set up Kubernetes cost monitoring (Kubecost, OpenCost)
**Optimize Phase:** Take action on cost drivers
- Purchase commitment-based discounts (40-72% savings)
- Right-size over-provisioned resources (target 60-80% utilization)
- Implement spot/preemptible instances for fault-tolerant workloads
- Clean up idle resources (unattached volumes, old snapshots)
**Operate Phase:** Automate and govern
- Budget alerts with cascading notifications (50%, 75%, 90%, 100%)
- Automated cleanup scripts for idle resources
- CI/CD cost estimation to prevent surprise increases
- Continuous monitoring with anomaly detection
### Core FinOps Principles
1. **Collaboration:** Cross-functional teams (finance, engineering, operations, product)
2. **Accountability:** Teams own the cost of their services
3. **Transparency:** All costs visible and understandable to stakeholders
4. **Optimization:** Continuous improvement of cost efficiency
For detailed FinOps maturity models and organizational structures, see `references/finops-foundations.md`.
## Cost Optimization Strategies
### 1. Commitment-Based Discounts
**Reserved Instances (RIs):** 40-72% discount for 1-3 year commitments
- **Standard RI:** Instance type locked, highest discount (60% for 3-year)
- **Convertible RI:** Flexible instance types, moderate discount (54% for 3-year)
- **Use for:** Databases (RDS, ElastiCache), stable production EC2 workloads
**Savings Plans:** Flexible compute commitments
- **Compute Savings Plans:** Applies to EC2, Fargate, Lambda (54% discount for 3-year)
- **EC2 Instance Savings Plans:** Tied to instance family (66% discount for 3-year)
- **Use for:** Workloads that change instance types or regions
**GCP Committed Use Discounts (CUDs):** 25-70% discount
- **Resource-based CUDs:** Commit to vCPU, memory, GPUs
- **Spend-based CUDs:** Commit to dollar amount (flexible)
- **Sustained Use Discounts:** Automatic 20-30% discount for sustained usage (no commitment)
**Decision Framework:**
```
Reserve when:
├─ Workload is production-critical (24/7 uptime required)
├─ Usage is predictable (stable baseline over 6+ months)
├─ Architecture is stable (unlikely to change instance types)
└─ Financial commitment acceptable (1-3 year lock-in)
Use On-Demand when:
├─ Development/testing environments
├─ Unpredictable spiky workloads
├─ Short-term projects (<6 months)
└─ Evaluating new instance types
```
For detailed commitment strategies and RI coverage analysis, see `references/commitment-strategies.md`.
### 2. Spot and Preemptible Instances
**Discount:** 70-90% off on-demand pricing (interruptible with 2-minute warning)
**Use Spot For:** CI/CD workers, batch jobs, ML training (with checkpointing), Kubernetes workers, data analytics
**Avoid Spot For:** Stateful databases, real-time services, long-running jobs without checkpointing
**Best Practices:**
- Diversify instance types and spread across Availability Zones
- Implement graceful shutdown handlers
- Auto-fallback to on-demand when capacity unavailable
- Kubernetes: Mix 70% spot + 30% on-demand nodes with taints/tolerations
### 3. Right-Sizing Strategies
**Target Utilization:** 60-80% average (leave headroom for spikes)
**Compute Right-Sizing:**
- Analyze actual CPU/memory utilization over 30+ days
- Downsize instances with <40% average utilization
- Consolidate underutilized workloads
- Switch instance families (compute-optimized vs. memory-optimized)
**Database Right-Sizing:**
- Analyze connection pool usage (max connections vs. allocated)
- Downgrade storage IOPS if utilization <50%
- Evaluate read replica necessity (can caching replace it?)
- Consider serverless options (Aurora Serverless, Azure SQL Serverless)
**Kubernetes Right-Sizing:**
- Set requests = average usage (not peak)
- Set limits = 2-3x requests (allow bursting)
- Use Vertical Pod Autoscaler (VPA) for automated recommendations
- Identify pods with 0% CPU usage (candidates for consolidation)
**Storage Right-Sizing:**
- Delete unattached volumes (EBS, Azure Disks, GCP Persistent Disks)
- Delete old snapshots (>90 days, retention policy not required)
- Implement lifecycle policies (S3 Intelligent-Tiering, Azure Blob Lifecycle)
- Compress/deduplicate data
**Right-Sizing Tools:**
- **AWS Compute Optimizer:** ML-based EC2, Lambda, EBS recommendations
- **Azure Advisor:** VM rightsizing, reserved instance advice
- **GCP Recommender:** VM, disk, commitment recommendations
- **VPA (Vertical Pod Autoscaler):** Automated container resource requests
### 4. Kubernetes Cost Management
**Resource Requests and Limits:**
```yaml
# Set requests = average usage (enables efficient bin-packing)
resources:
requests:
cpu: 500m # 0.5 CPU cores (average usage)
memory: 1Gi # 1 GiB memory (average usage)
limits:
cpu: 1500m # 1.5 CPU cores (3x requests, allows bursting)
memory: 3Gi # 3 GiB memory (3x requests)
```
**Namespace Quotas:** Prevent runaway resource consumption
- ResourceQuota: Limit total CPU/memory per namespace
- LimitRange: Default/max requests per pod
- PriorityClass: Ensure critical pods get resources
**Cluster Autoscaling:**
- Scale down idle nodes to reduce costs
- Scale-to-zero for dev clusters during off-hours
- Use multiple node pools (spot + on-demand mix)
- Set max node limits to prevent overspend
**Cost Visibility:**
- Deploy Kubecost or OpenCost for namespace-level cost tracking
- Allocate costs by labels (team, project, environment)
- Track idle cost (cluster capacity not allocated to workloads)
- Generate showback/chargeback reports
For detailed Kubernetes cost optimization patterns, see `references/kubernetes-cost-optimization.md`.
## Cost Visibility and Monitoring
### Tagging for Cost Allocation
**Required Tags:**
- `Owner` or `Team` - Responsible team/department
- `Project` or `Application` - Business unit or application name
- `Environment` - prod, staging, dev, test
- `CostCenter` - Finance cost center code
**Enable Cost Allocation Tags:**
- **AWS:** Activate tags in Cost Allocation Tags console
- **Azure:** Apply tags via Azure Policy enforcement
- **GCP:** Use labels on all resources, export to BigQuery
For comprehensive tagging strategies, see `references/tagging-for-cost-allocation.md`.
### Monitoring and Dashboards
**Native Cloud Tools:**
- **AWS Cost Explorer:** Analyze spending patterns, forecast costs
- **Azure Cost Management + Billing:** Budget tracking, cost analysis
- **GCP Cloud Billing:** BigQuery export for custom analysis
**Third-Party Platforms:**
- **Kubecost:** Kubernetes cost visibility and optimization
- **CloudZero:** Unit cost economics, anomaly detection
- **CloudHealth:** Multi-cloud cost management
- **Infracost:** Terraform cost estimation in CI/CD
**Key Metrics to Track:**
- Total monthly cloud spend (trend over time)
- Cost per service/team/project (allocation accuracy)
- Unit cost metrics (cost per customer, cost per transaction)
- Reserved Instance/Savings Plan utilization (target >95%)
- Idle resource waste (target <5% of total spend)
- Budget variance (forecasted vs. actual)
### Budget Alerts and Anomaly Detection
**Cascading Budget Alerts:**
```
50% of budget → Email to team lead (informational)
75% of budget → Email + Slack to team (warning)
90% of budget → Email + Slack + PagerDuty (urgent)
100% of budget → Automated shutdown (non-prod only) or escalation
```
**Anomaly Detection:** Alert on unexpected cost spikes
- >20% cost increase week-over-week
- >$500 unexpected daily cost spike
- New resource types (unusual spend patterns)
**Budget Granularity:**
- Organization-level (total cloud spend)
- Department-level (engineering, data, marketing)
- Project-level (per application/service)
- Environment-level (prod vs. dev/staging)
## Decision Frameworks
### Framework 1: Commitment Discount Decision Tree
```
Should we purchase Reserved Instances / Savings Plans?
STEP 1: Analyze Historical Usage (6-12 months)
├─ Identify steady-state baseline (minimum usage)
├─ Exclude spiky/seasonal workloads
└─ Calculate: (baseline usage) / (total usage) = commitment %
STEP 2: Choose Commitment Type
├─ RESERVED INSTANCES
│ ├─ Pros: Highest discount (up to 72%)
│ ├─ Cons: Instance type locked (unless convertible)
│ └─ Use for: Databases, stable production workloads
│
├─ SAVINGS PLANS
│ ├─ Pros: Flexible (across instance types, regions)
│ ├─ Cons: Slightly lower discount than RI
│ └─ Use for: Compute workloads, Lambda, Fargate
│
└─ COMMITTED USE DISCOUNTS (GCP)
├─ Resource-based: vCPU/memory commitments
└─ Spend-based: Dollar amount commitments
STEP 3: Determine Commitment Period
├─ 1-year commitment
│ ├─ Lower discount (40-50%)
│ └─ Less risk if architecture changes
│
└─ 3-year commitment
├─ Higher discount (60-72%)
└─ Only for mature, stable workloads
STEP 4: Monitor and Optimize
├─ Target >95% RI/Savings Plan utilization
├─ Sell unused RIs on AWS Reserved Instance Marketplace
└─ Adjust commitments quarterly based on usage trends
```
### Framework 2: Right-Sizing Priority Matrix
**Cost Impact vs. Effort:**
**High Impact, Low Effort (DO FIRST):**
- Idle resources (100% waste): Stopped instances, unattached volumes, old snapshots
- Unused NAT Gateways ($32/month each)
- Over-provisioned databases (<20% CPU for 30 days)
- Kubernetes pods with no resource requests set
**High Impact, Medium Effort (DO SECOND):**
- Over-provisioned compute (<40% CPU/memory for 30 days)
- Lambda functions with max memory >2x used memory
- Storage optimization (S3 Intelligent-Tiering, gp3 vs. gp2)
**Low Impact, High Effort (DO LAST):**
- Application code optimization (requires profiling, refactoring)
- Architecture redesign (serverless migration, multi-region optimization)
**Weekly Optimization Routine:**
1. Delete idle resources (automated script)
2. Review top 10 cost drivers (manual analysis)
3. Right-size 3-5 instances/week (incremental approach)
4. Monitor impact (cost trend over 4 weeks)
### Framework 3: Spot vs. On-Demand Decision
```
Should this workload use Spot/Preemptible instances?
├─ Is the workload fault-tolerant?
│ ├─ NO → Use On-Demand
│ └─ YES → Continue
│
├─ Is the workload stateless (or has checkpointing)?
│ ├─ NO → Use On-Demand (data loss risk)
│ └─ YES → Continue
│
├─ Can the workload handle interruptions gracefully?
│ ├─ NO → Use On-Demand
│ └─ YES → Continue
│
└─ Workload Type Assessment:
├─ Batch Jobs / CI/CD → ✅ Use Spot (70-90% savings)
├─ ML Training → ✅ Use Spot (with checkpointing)
├─ Kubernetes Workers → ✅ Use Spot (mixed with on-demand)
├─ Production API Servers → ⚠️ Mixed fleet (70% spot, 30% on-demand)
├─ Databases → ❌ Use On-Demand (or Reserved)
└─ Real-time Services → ❌ Use On-Demand (or Reserved)
```
## Tool Selection Guide
### By Platform
| Platform | Cost Visibility | Right-Sizing | Automation |
|----------|----------------|--------------|------------|
| **AWS** | Cost Explorer, CUR | Compute Optimizer | AWS Budgets, Lambda cleanup |
| **Azure** | Cost Management | Azure Advisor | Azure Policy, Automation |
| **GCP** | Cloud Billing | Recommender | Budget Alerts, Cloud Functions |
| **Kubernetes** | Kubecost, OpenCost | VPA | Cluster Autoscaler |
| **Multi-Cloud** | CloudZero, CloudHealth | Densify | ParkMyCloud |
### By Use Case
| Use Case | Recommended Tool | Key Feature |
|----------|------------------|-------------|
| K8s cost visibility | Kubecost | Real-time namespace cost allocation |
| Terraform cost estimation | Infracost | PR comments with cost diffs |
| Multi-cloud aggregation | CloudHealth | Unified cost view across AWS/Azure/GCP |
| Automated optimization | nOps (AWS), CAST AI (K8s) | ML-based automation |
| Unit cost economics | CloudZero | Cost per customer/transaction tracking |
| Spot instance management | Spot.io | Automated spot orchestration |
For detailed tool comparisons and selection criteria, see `references/tools-comparison.md`.
## Cloud-Specific Tactics
### AWS Optimization Tactics
1. **Enable Cost & Usage Reports (CUR):** Export detailed billing to S3
2. **Use AWS Compute Optimizer:** ML-based EC2 rightsizing recommendations
3. **Implement Savings Plans:** More flexible than Reserved Instances
4. **S3 Intelligent-Tiering:** Automatic storage class optimization
5. **Lambda Right-Sizing:** Adjust memory allocation (CPU scales proportionally)
6. **EBS gp3 Migration:** 20% cheaper than gp2 with same performance
### Azure Optimization Tactics
1. **Enable Azure Advisor:** VM rightsizing and reserved instance recommendations
2. **Azure Hybrid Benefit:** Bring Windows Server licenses for discounts
3. **Dev/Test Pricing:** Reduced rates for non-production workloads
4. **Azure Spot VMs:** Up to 90% discount for interruptible workloads
5. **Storage Lifecycle Management:** Auto-tier blobs to cool/archive tiers
### GCP Optimization Tactics
1. **Export Billing to BigQuery:** Custom cost analysis with SQL
2. **Sustained Use Discounts:** Automatic 20-30% discount (no commitment)
3. **Committed Use Discounts:** 52-70% savings for 3-year commitments
4. **Preemptible VMs:** Up to 91% discount for batch workloads
5. **GCP Recommender:** Idle VM detection and rightsizing advice
For cloud-specific deep dives, see `references/cloud-specific-tactics.md`.
## Implementation Checklist
### Phase 1: Establish Visibility (Week 1-2)
- [ ] Enable cost allocation tags (Owner, Project, Environment)
- [ ] Activate cost allocation tags in cloud billing console
- [ ] Deploy Kubecost for Kubernetes cost visibility (if using K8s)
- [ ] Create cost dashboards (Grafana, CloudWatch, Azure Monitor, GCP)
- [ ] Set up weekly cost reports (emailed to team leads)
### Phase 2: Set Up Governance (Week 2-3)
- [ ] Create budget alerts (50%, 75%, 90%, 100% thresholds)
- [ ] Enable anomaly detection (>20% WoW increase)
- [ ] Implement tagging policy enforcement (Azure Policy, AWS Config, GCP Org Policy)
- [ ] Establish showback reports (cost by team/project)
- [ ] Document cost ownership (who owns which services)
### Phase 3: Quick Wins (Week 3-4)
- [ ] Delete idle resources (unattached volumes, old snapshots)
- [ ] Stop/terminate unused development instances
- [ ] Right-size top 10 over-provisioned instances (<40% utilization)
- [ ] Implement S3 Intelligent-Tiering or lifecycle policies
- [ ] Evaluate Reserved Instance/Savings Plan coverage
### Phase 4: Commitment Discounts (Month 2)
- [ ] Analyze 6-12 months usage history
- [ ] Calculate baseline usage for commitment sizing
- [ ] Purchase Reserved Instances for databases
- [ ] Purchase Savings Plans for compute workloads
- [ ] Monitor RI/SP utilization (target >95%)
### Phase 5: Automation (Month 2-3)
- [ ] Deploy automated cleanup scripts (weekly schedule)
- [ ] Integrate Infracost into CI/CD pipelines
- [ ] Implement auto-shutdown for dev/test environments (off-hours)
- [ ] Enable Vertical Pod Autoscaler (VPA) for K8s rightsizing
- [ ] Set up Spot instance automation (Spot.io, CAST AI, or native)
### Phase 6: Continuous Optimization (Ongoing)
- [ ] Weekly cost reviews with engineering teams
- [ ] Monthly optimization sprints (top cost drivers)
- [ ] Quarterly commitment adjustments (RI/SP coverage)
- [ ] Annual FinOps maturity assessment
## Common Pitfalls
### Pitfall 1: No Cost Visibility
❌ **Problem:** Finance team sees cloud bill at end of month, surprises everywhere
✅ **Solution:** Deploy real-time cost dashboards, daily Slack reports to engineering teams
### Pitfall 2: Reserved Instance Underutilization
❌ **Problem:** Purchased 100 RIs, only using 60 (40% wasted commitment)
✅ **Solution:** Monitor RI utilization weekly (target >95%), sell unused RIs on marketplace
### Pitfall 3: Missing Kubernetes Resource Requests
❌ **Problem:** Pods with no requests set → inefficient bin-packing → wasted nodes
✅ **Solution:** Use VPA to auto-generate recommendations, enforce via admission control
### Pitfall 4: Idle Resources Not Cleaned Up
❌ **Problem:** 50 stopped EC2 instances (still paying for EBS), 200 unattached volumes
✅ **Solution:** Weekly automated cleanup of idle resources >7 days old
### Pitfall 5: No Budget Alerts
❌ **Problem:** Accidentally left test cluster running, $10K bill surprise
✅ **Solution:** Budget alerts at 50%, 75%, 90%, 100% with Slack/PagerDuty notifications
## Related Skills
- **resource-tagging:** Cost allocation tags enable showback/chargeback models
- **kubernetes-operations:** K8s rightsizing, VPA, cluster autoscaling for cost optimization
- **infrastructure-as-code:** Infracost for Terraform cost estimation and policy-as-code
- **aws-patterns:** AWS-specific cost optimization tactics (EC2, RDS, S3, Lambda)
- **gcp-patterns:** GCP-specific optimizations (Compute Engine, BigQuery, Cloud Storage)
- **azure-patterns:** Azure-specific optimizations (VMs, Storage, App Service, Functions)
- **platform-engineering:** Internal FinOps platforms and self-service cost dashboards
- **disaster-recovery:** Balance cost vs. RTO/RPO (warm standby vs. cold standby)
## Examples
See `examples/` directory for:
- **terraform/**: AWS, Azure, GCP cost optimization infrastructure (budgets, alerts)
- **kubernetes/**: Kubecost deployment, resource quotas, VPA configurations
- **ci-cd/**: Infracost GitHub Actions, cost approval workflows
- **dashboards/**: Grafana cost dashboards, CloudWatch alarms
## Scripts
See `scripts/` directory for:
- **cleanup_idle_resources.py:** Automated AWS/Azure/GCP idle resource cleanup
- **ri_coverage_report.py:** Reserved Instance coverage analysis
- **cost_allocation_report.py:** Generate showback/chargeback reports
- **spot_savings_calculator.py:** Estimate savings from spot instances
- **k8s_rightsizing_audit.py:** Find K8s pods with missing resource requests
## Key Takeaways
1. **FinOps is a Culture:** Collaboration between finance, engineering, and operations
2. **Visibility First:** Can't optimize what can't measure (tags + dashboards mandatory)
3. **Commitment = Savings:** Reserved Instances/Savings Plans provide 40-72% discounts
4. **Right-Size Continuously:** Target 60-80% utilization (leave headroom for spikes)
5. **Automate Cleanup:** Idle resources are 100% waste (weekly automated deletion)
6. **Kubernetes Costs Hidden:** Use Kubecost/OpenCost for namespace-level visibility
7. **Shift-Left Cost Awareness:** Infracost in CI/CD prevents surprise cost increases
8. **Budget Alerts Prevent Overspend:** Cascading notifications at 50%, 75%, 90%, 100%
9. **Spot for Fault-Tolerant Workloads:** 70-90% discount (CI/CD, batch jobs, ML training)
10. **Unit Cost Metrics Drive Value:** Track cost per customer, cost per transaction
This skill helps reduce cloud infrastructure costs using FinOps practices, commitment discounts, right-sizing, Kubernetes cost management, and automated governance across AWS, Azure, GCP, and Kubernetes. It delivers decision frameworks, tactical playbooks, and repeatable routines to convert uncontrolled spend into predictable, optimized cloud budgets. Use it to establish cost visibility, enforce budgets, and automate cleanup and recommendations.
The skill inspects billing exports, resource telemetry, and Kubernetes metrics to identify waste and optimization opportunities. It recommends commitment purchases (RIs, Savings Plans, CUDs), right-sizing actions for compute, databases, and storage, and spot/preemptible strategies for fault-tolerant workloads. It also defines automation patterns for budget alerts, idle-resource cleanup, CI/CD cost estimates, and namespace-level cost allocation in Kubernetes.
When should I buy commitment discounts vs use on-demand?
Buy commitments for stable, production workloads with predictable baseline usage and mature architecture; use on-demand for dev/test, spiky or short-term projects.
How do I safely use spot/preemptible instances?
Use them for fault-tolerant, stateless or checkpointed jobs, diversify instance types/AZs, implement graceful shutdowns and auto-fallback to on-demand.