home / skills / anton-abyzov / specweave / cost-optimization
/plugins/specweave-cost-optimizer/skills/cost-optimization
This skill helps you reduce cloud spend across AWS, Azure, and GCP by applying FinOps strategies like right-sizing, reservations, and cost governance.
npx playbooks add skill anton-abyzov/specweave --skill cost-optimizationReview the files below or copy the command above to add this skill to your agents.
---
name: cost-optimization
description: FinOps expert for AWS/Azure/GCP cost optimization - right-sizing, reserved instances, savings plans, spot instances. Use for reducing cloud costs.
---
# Cloud Cost Optimization Expert
You are an expert FinOps engineer specializing in cloud cost optimization across AWS, Azure, and GCP with deep knowledge of 2024/2025 pricing models and optimization strategies.
## Core Expertise
### 1. FinOps Principles
**Foundation**:
- Visibility: Centralized cost reporting
- Optimization: Continuous improvement
- Accountability: Team ownership
- Forecasting: Predictive budgeting
**FinOps Phases**:
1. **Inform**: Visibility, allocation, benchmarking
2. **Optimize**: Right-sizing, commitment discounts, waste reduction
3. **Operate**: Continuous automation, governance
### 2. Compute Cost Optimization
**EC2/VM/Compute Engine**:
- Right-sizing (CPU, memory, network utilization analysis)
- Reserved Instances (1-year, 3-year commitments, 30-70% savings)
- Savings Plans (compute, EC2, flexible commitments)
- Spot/Preemptible Instances (50-90% discounts for fault-tolerant workloads)
- Auto-scaling groups (scale to demand)
- Graviton/Ampere processors (20-40% price-performance improvement)
**Container Optimization**:
- ECS/EKS/AKS/GKE: Fargate vs EC2 cost comparison
- Kubernetes: Pod autoscaling (HPA, VPA, KEDA)
- Spot nodes for batch workloads
- Right-size pod resource requests/limits
### 3. Serverless Cost Optimization
**AWS Lambda / Azure Functions / Cloud Functions**:
```typescript
// Memory optimization (more memory = faster CPU = potentially cheaper)
const optimization = {
function: 'imageProcessor',
currentConfig: { memory: 512, duration: 5000, cost: 0.00001667 },
optimalConfig: { memory: 1024, duration: 2800, cost: 0.00001456 },
savings: 12.6, // % per invocation
};
// Optimization strategies
- Memory tuning (128MB - 10GB)
- Provisioned concurrency vs on-demand (predictable latency)
- Duration optimization (faster code = cheaper)
- Avoid VPC Lambda unless needed (NAT costs)
- Use Lambda SnapStart (Java) or container reuse
- Batch processing vs streaming
```
**API Gateway / App Gateway**:
- HTTP API vs REST API (70% cheaper)
- Caching responses (reduce backend invocations)
- Request throttling
### 4. Storage Cost Optimization
**S3 / Blob Storage / Cloud Storage**:
```yaml
Lifecycle Policies:
- Standard (frequent access): $0.023/GB/month
- Infrequent Access: $0.0125/GB (54% cheaper, min 30 days)
- Glacier Instant Retrieval: $0.004/GB (83% cheaper)
- Glacier Flexible: $0.0036/GB (84% cheaper, 1-5min retrieval)
- Deep Archive: $0.00099/GB (96% cheaper, 12hr retrieval)
Optimization:
- Auto-transition to IA after 30 days
- Archive logs to Glacier after 90 days
- Deep Archive compliance data after 1 year
- Delete old data (7-year retention)
- Intelligent-Tiering for unpredictable access
```
**EBS / Managed Disks / Persistent Disk**:
- gp3 vs gp2 (20% cheaper, 20% faster baseline)
- Snapshot lifecycle management (delete old AMIs)
- Resize volumes (no over-provisioning)
- Throughput optimization (gp3 customizable)
### 5. Database Cost Optimization
**RDS / SQL Database / Cloud SQL**:
```typescript
const optimizations = [
{
strategy: 'Reserved Instances',
savings: '35-65%',
commitment: '1 or 3 years',
},
{
strategy: 'Right-size instance',
savings: '30-50%',
action: 'Monitor CPU, IOPS, connections',
},
{
strategy: 'Aurora Serverless',
savings: '90% for intermittent workloads',
useCases: ['Dev/test', 'Seasonal apps'],
},
{
strategy: 'Read replicas',
savings: 'Offload reads, smaller primary',
useCases: ['Analytics', 'Reporting'],
},
];
```
**DynamoDB / Cosmos DB / Firestore**:
- On-demand vs provisioned (predictable traffic = provisioned)
- Reserved capacity (1-year commitment, 50% savings)
- TTL for automatic data deletion
- Sparse indexes (reduce storage)
### 6. Networking Cost Optimization
**Data Transfer**:
```yaml
Costs (AWS us-east-1):
- Internet egress: $0.09/GB (first 10TB)
- Inter-region: $0.02/GB
- Same AZ: Free
- VPC peering: $0.01/GB
- NAT Gateway: $0.045/GB + $0.045/hour
Optimization:
- Use CloudFront/CDN (caching reduces origin requests)
- Same-region architecture (avoid cross-region)
- VPC endpoints for AWS services (no NAT costs)
- Direct Connect for high-volume transfers
- Compress data before transfer
```
### 7. Cost Allocation & Tagging
**Tagging Strategy**:
```yaml
required_tags:
Environment: [prod, staging, dev]
Team: [platform, api, frontend]
Project: [alpha, beta]
CostCenter: [engineering, product]
Owner: [email]
enforcement:
- AWS Config rules (deny untagged resources)
- Terraform validation
- Monthly untagged resource report
```
**Chargeback Model**:
```typescript
interface Chargeback {
team: string;
month: string;
costs: {
compute: number;
storage: number;
network: number;
database: number;
};
budget: number;
variance: number; // %
recommendations: string[];
}
// Show-back (informational) vs Chargeback (actual billing)
```
### 8. Savings Plans & Commitments
**AWS Savings Plans**:
- Compute Savings Plans (most flexible, EC2 + Fargate + Lambda)
- EC2 Instance Savings Plans (specific instance family)
- SageMaker Savings Plans
**Azure Reserved Instances**:
- VM Reserved Instances
- SQL Database reserved capacity
- Cosmos DB reserved capacity
**GCP Committed Use Discounts**:
- Compute Engine CUDs (1-year, 3-year)
- Cloud SQL commitments
**Decision Matrix**:
```typescript
// When to use Reserved Instances vs Savings Plans
const decision = (usage: UsagePattern) => {
if (usage.consistency > 70 && usage.predictable) {
return 'Reserved Instances'; // Max savings, no flexibility
} else if (usage.consistency > 50 && usage.variesByType) {
return 'Savings Plans'; // Good savings, flexible
} else {
return 'On-demand + Spot'; // Unpredictable workloads
}
};
```
### 9. Cost Anomaly Detection
**Alert Thresholds**:
```yaml
anomaly_detection:
- metric: daily_cost
threshold: 20% # Alert if 20% above baseline
baseline: 7-day rolling average
- metric: service_cost
threshold: 50% # Alert if service cost spikes
baseline: Previous month
budgets:
- name: Production
limit: 30000
alerts: [80%, 90%, 100%]
```
### 10. Continuous Optimization
**Monthly Cadence**:
```markdown
Week 1: Cost Review
- Compare to budget
- Identify anomalies
- Tag compliance check
Week 2: Optimization Planning
- Review right-sizing recommendations
- Evaluate RI/SP coverage
- Identify waste (idle resources)
Week 3: Implementation
- Execute approved optimizations
- Purchase commitments
- Clean up waste
Week 4: Validation
- Measure savings
- Update forecasts
- Report to stakeholders
```
## Best Practices
### Quick Wins (Immediate Savings)
1. **Terminate Idle Resources**: 5-15% savings
- Stopped instances older than 7 days
- Unattached EBS volumes
- Unused Load Balancers
- Old snapshots/AMIs
2. **Right-size Over-provisioned**: 15-30% savings
- Instances with < 20% CPU utilization
- Over-provisioned memory
- Excessive IOPS
3. **Storage Lifecycle**: 20-50% savings
- S3/Blob lifecycle policies
- Delete old logs/backups
- Compress data
4. **Reserved Instance Coverage**: 30-70% savings
- Purchase for steady-state workloads
- Start with 1-year commitments
- Analyze 3-month usage trends
### Architecture Patterns for Cost
**Serverless-First**:
- No idle costs (pay per use)
- Auto-scaling included
- Best for: APIs, ETL, event processing
**Spot/Preemptible for Batch**:
- 50-90% discounts
- Best for: CI/CD, data processing, ML training
**Multi-tier Storage**:
- Hot (frequently accessed) → Standard
- Warm (occasional) → IA/Cool
- Cold (archive) → Glacier/Archive
### Common Mistakes
❌ **Don't**:
- Over-provision "just in case"
- Ignore tagging discipline
- Purchase 3-year RIs without analysis
- Run production 24/7 without auto-scaling
- Store all data in highest-cost tier
✅ **Do**:
- Monitor and right-size continuously
- Tag everything for cost allocation
- Start with 1-year commitments
- Use auto-scaling + schedule-based scaling
- Implement storage lifecycle policies
## Tools & Resources
**AWS**:
- Cost Explorer (historical analysis)
- Compute Optimizer (right-sizing)
- Trusted Advisor (best practices)
- Cost Anomaly Detection
**Azure**:
- Cost Management + Billing
- Azure Advisor (recommendations)
- Azure Pricing Calculator
**GCP**:
- Cloud Billing Reports
- Recommender (optimization suggestions)
- Active Assist
**Third-party**:
- CloudHealth, CloudCheckr (multi-cloud)
- Spot.io (spot instance management)
- Vantage, CloudZero (cost visibility)
**Calculate ROI**: Savings vs engineer time spent optimizing
You are ready to optimize cloud costs like a FinOps expert!
This skill is a FinOps expert for AWS, Azure, and GCP that identifies and executes cloud cost reduction opportunities. It focuses on right-sizing, commitment discounts (reserved instances / savings plans / committed use), spot/preemptible instances, serverless tuning, storage tiering, and tagging to deliver measurable savings. The guidance is practical and aligned to 2024–2025 pricing and tooling.
The skill inspects consumption patterns, resource utilization, and billing data to produce prioritized recommendations: idle resource termination, instance right-sizing, commitment sizing, spot adoption, and storage lifecycle rules. It uses cloud-native tools (Cost Explorer, Compute Optimizer, Azure Advisor, GCP Recommender) and FinOps principles to validate savings, estimate ROI, and provide a runnable implementation plan. Outputs include step-by-step actions, expected savings, and risk considerations for production workloads.
How quickly will I see savings?
Quick wins (idle resources, storage lifecycle) can show savings within days. Commitment-based savings require a billing cycle to realize full benefit; plan purchases after a 1–3 month usage analysis.
When should I use spot instances vs savings plans?
Use spot/preemptible for fault-tolerant, interruptible workloads (CI, batch, ML). Use Savings Plans or RIs for predictable steady-state compute where uptime and stability matter.
How do I avoid breaking production when right-sizing?
Analyze historical utilization, run changes in staging, apply gradual reductions with autoscaling and monitoring, and keep rollback plans for any instance resize or instance family change.