home / skills / amnadtaowsoam / cerebraskills / cost-observability

cost-observability skill

safe

This skill helps you monitor cloud spending and attribute costs to teams, enabling proactive optimization and financial accountability.

npx playbooks add skill amnadtaowsoam/cerebraskills --skill cost-observability

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

6.0 KB

---
name: Cost Observability and Monitoring
description: Techniques for gaining visibility into cloud spending, attributing costs to business units, and detecting financial anomalies.
---

# Cost Observability and Monitoring

## Overview

Cost Observability is the practice of extending traditional system observability (logs, metrics, traces) to include **Financial** data. It allows engineering teams to answer not just "Is the system healthy?" but "Is the system cost-effective?".

**Core Principle**: "Total spend is a vanity metric; cost per unit of work is a performance metric."

---

## 1. Key Cost Metrics to Track

The goal is to move from **Macro** visibility (the bill) to **Micro** visibility (the request).

| Metric | Level | Purpose |
| :--- | :--- | :--- |
| **Total Monthly Spend** | Executive | General budget health. |
| **Cost per Service** | Engineering | Identify inefficient microservices. |
| **Cost per Customer (Unit Cost)**| Product | Calculate per-account profitability. |
| **Cost per Request** | Engineering | Measure efficiency of application code. |
| **COGS (Cost of Goods Sold)** | Financial | The base cost to deliver the service. |

---

## 2. Cost Attribution and Tagging Strategy

Attribution is impossible without consistent metadata.

### The Standard Tagging Schema
Every resource should have the following "FinOps Tags":

1.  **`Environment`**: (e.g., `prod`, `staging`, `dev`)
2.  **`Service`**: (e.g., `auth-api`, `image-processor`)
3.  **`Owner`**: (e.g., `team-alpha`)
4.  **`Project`**: (e.g., `project-phoenix`)
5.  **`TenantID`**: (If using siloed resources per customer)

### Enforcement Policy (Terraform/OpenTofu)
```hcl
# Use a variable for mandatory tags
locals {
  mandatory_tags = {
    Environment = var.environment
    Service     = "payment-gateway"
    Owner       = "finance-team"
    CostCenter  = "9921"
  }
}

resource "aws_instance" "app" {
  ami           = "ami-12345"
  instance_type = "t3.medium"
  tags          = local.mandatory_tags
}
```

---

## 3. Cost Anomaly Detection

A financial anomaly is a sudden deviate from historical spend patterns.

### Types of Anomalies
1.  **Sudden Spikes**: A developers spins up a massive GPU instance and forgets to delete it.
2.  **Gradual Drift**: A memory leak causes auto-scaling to add a new server every day.
3.  **Cyclical Variation**: Spend increases during weekends when it should be lower.

### Anomaly Alert Example (Slack/PagerDuty)
*   **Alert**: "AWS Spend Spike Detected"
*   **Metric**: `S3 Egress`
*   **Deviation**: +450% over the last 24 hours.
*   **Likely Cause**: Possible data exfiltration or misconfigured backup script.

---

## 4. Application-Level Cost Tracking

Sometimes cloud tags aren't granular enough (e.g., when multiple customers share one database).

### OpenTelemetry for Cost
You can inject "cost attributes" into your traces to calculate the price of a specific API endpoint.

```typescript
// Example: Tracking LLM cost in a trace
import { trace } from '@opentelemetry/api';

const span = trace.getTracer('llm-tracer').startSpan('generate_text');
// ... perform LLM call
const cost = (inputTokens * 0.00001) + (outputTokens * 0.00003);

span.setAttribute('app.cost.usd', cost);
span.setAttribute('app.tokens.input', inputTokens);
span.end();
```

---

## 5. Dashboard Templates

### Engineering Dashboard (Grafana)
*   **Top 5 Costliest Microservices** (Bar chart)
*   **Idle Resource Count** (Single stat)
*   **Compute Efficiency** (CPU utilization vs. Cost)
*   **Data Egress by Region** (Pie chart)

### Product/Executive Dashboard
*   **Revenue vs. Infrastructure Cost** (Area chart)
*   **Margin per Feature** (Heatmap)
*   **Cost per Daily Active User (DAU)** (Line chart)

---

## 6. Tools Ecosystem

### Native Cloud Tools
*   **AWS Cost Explorer**: Best for monthly trends and filtered views.
*   **AWS Cost Anomaly Detection**: Uses ML to flag unusual spend automatically.
*   **GCP Recommender**: Suggests specific sizing changes to save money.

### Specialized Tools
*   **CloudHealth / Cloudability**: Enterprise-grade cost allocation and multi-cloud reporting.
*   **Kubecost**: The standard for Kubernetes. It models costs based on pod resource requests.
*   **Infracost**: A CLI tool that runs in CI/CD to tell you how much a Pull Request will cost before it's merged.

---

## 7. Chargeback vs. Showback

How do you hold teams accountable?

| Model | Description | Pros | Cons |
| :--- | :--- | :--- | :--- |
| **Showback** | Reporting costs to teams without actually billing their budgets. | Low friction, creates awareness. | No "teeth"; teams can ignore. |
| **Chargeback**| Directly deducting cloud costs from a department's real budget. | Forces accountability, drives optimization. | High administrative overhead. |

---

## 8. Cost Forecasting

Forecasting helps avoid end-of-quarter budget surprises.

1.  **Linear Projection**: `NextMonth = ThisMonthAverage * GrowthRate`.
2.  **Seasonal aware**: Accounting for peak periods like Black Friday or holiday sales.
3.  **Scenario Planning**: "If we double our user base, what happens to our NAT Gateway costs?"

---

## 9. Common Optimization Targets

*   **S3 Storage Class Analysis**: Finding buckets that could move to Infrequent Access.
*   **Database Query Analysis**: Finding a single query that causes high CPU/IOPS across thousands of DB connections.
*   **Zombie Snapshots**: Deleting EBS snapshots older than 90 days.

---

## 10. Implementation Checklist

- [ ] **Tagging Enforcement**: Do resources without tags trigger an alert or auto-deletion?
- [ ] **Accountability**: Does every `Team` have a dashboard showing their spend?
- [ ] **Thresholds**: Are there daily spending alerts set at 20% above "normal"?
- [ ] **Unit Economics**: Do we know the infrastructure cost of a single user transaction?
- [ ] **Forecasting**: Are we predicting next month's bill with < 10% error?

---

## Related Skills
- `42-cost-engineering/cloud-cost-models`
- `42-cost-engineering/budget-guardrails`
- `40-system-resilience/chaos-engineering` (using chaos to test cost stability)

Overview

This skill teaches practical techniques for gaining visibility into cloud spending, attributing costs to teams and products, and detecting financial anomalies. It focuses on moving from bill-level views to per-request and per-customer cost metrics so engineering, product, and finance can make data-driven decisions. The guidance covers tagging, anomaly detection, application-level cost tracking, dashboards, and tooling choices.

How this skill works

The skill inspects cloud billing data, resource metadata (tags), telemetry (metrics/traces), and application events to map spend to business entities. It combines cost allocation rules with application-level attributes (for example, OpenTelemetry cost tags) to compute cost per service, customer, or request. Anomaly detection models flag deviations from historical patterns and trigger alerts for investigation.

When to use it

When you need to allocate cloud spend to teams, products, or customers for accountability.
When bills are growing and you want to find the services or queries driving the increase.
When you want to measure unit economics like cost per request, cost per customer, or COGS.
When you need automated alerts for sudden spend spikes or gradual cost drift.
When preparing executive dashboards that link revenue to infrastructure cost.

Best practices

Enforce a mandatory tagging schema (Environment, Service, Owner, Project, TenantID) at provisioning time via IaC.
Instrument application traces with cost attributes so shared resources can be attributed per request.
Track both macro metrics (total spend) and micro metrics (cost per request, per customer) to avoid misleading conclusions.
Use anomaly detection tuned to your patterns (hourly, daily, weekly) and route alerts to Slack/PagerDuty with context.
Combine native cloud tools (Cost Explorer, Anomaly Detection) with specialized tools (Kubecost, Infracost) for coverage.

Example use cases

Detecting a runaway GPU instance that caused a sudden $XK increase and alerting the owning team for remediation.
Calculating cost per active user to evaluate feature-level profitability before a product launch.
Using OpenTelemetry spans to charge multi-tenant database queries to individual TenantIDs.
Running CI checks with Infracost to reject PRs that introduce large recurring infra costs.
Building an engineering dashboard showing top 5 costliest microservices and idle resources to prioritize optimizations.

FAQ

How granular should tagging be?

Tag at the service and owner level as a minimum; add Project and TenantID for product or customer attribution. Keep tags consistent and enforced via IaC.

Which tools should I start with?

Start with native cloud cost explorer and anomaly detection for trend visibility, add Kubecost for Kubernetes, and use Infracost in CI for PR-level cost review.