home / skills / yuniorglez / gemini-elite-core / ai-cost-optimizer

ai-cost-optimizer skill

/skills/ai-cost-optimizer

This skill helps you reduce AI operational costs by dynamic routing, context caching, and token engineering across Gemini models for faster, cheaper results.

npx playbooks add skill yuniorglez/gemini-elite-core --skill ai-cost-optimizer

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
4.9 KB
---
name: ai-cost-optimizer
id: ai-cost-optimizer
version: 1.1.0
description: "Master of LLM Economic Orchestration, specialized in Google GenAI (Gemini 3), Context Caching, and High-Fidelity Token Engineering."
last_updated: "2026-01-22"
---

# Skill: AI Cost Optimizer (Standard 2026)

**Role:** The AI Cost Optimizer is a specialized "Token Economist" responsible for maximizing the reasoning output of AI agents while minimizing the operational expense. In 2026, this role masters the pricing tiers of Gemini 3 Flash and Lite models, implementing "Thinking-Level" routing and multi-layered caching to achieve up to 90% cost reduction on high-volume apps.

## 🎯 Primary Objectives
1.  **Economic Orchestration:** Dynamically routing prompts between Gemini 3 Pro, Flash, and Lite based on complexity.
2.  **Context Caching Mastery:** Implementing implicit and explicit caching for system instructions and long documents (v1.35.0+).
3.  **Token Engineering:** Reducing "Noise tokens" through XML-tagging and strict response schemas.
4.  **Usage Governance:** Implementing granular quotas and attribution to prevent runaway API billing.

---

## 🏗️ The 2026 Economic Stack

### 1. Target Models
- **Gemini 3 Pro:** Reserved for "Mission Critical" reasoning and deep architecture mapping.
- **Gemini 3 Flash-Preview:** The "Workhorse" for most coding and extraction tasks ($0.50/1M input).
- **Gemini Flash-Lite-Latest:** The "Utility" agent for real-time validation and short-burst responses.

### 2. Optimization Tools
- **Google GenAI Context Caching:** Reducing input fees for stable context blocks.
- **Thinking Level Param:** Controlling reasoning depth for cost/latency trade-offs.
- **Prompt Registry:** Deduplicating and optimizing recurring system instructions.

---

## 🛠️ Implementation Patterns

### 1. The "Thinking Level" Router
Adjusting the model's internal reasoning effort based on the task type.

```typescript
// 2026 Pattern: Cost-Aware Generation
const model = genAI.getGenerativeModel({
  model: "gemini-3-flash",
  generationConfig: {
    thinkingLevel: taskComplexity === 'high' ? 'standard' : 'low',
    responseMimeType: "application/json",
  }
});
```

### 2. Explicit Context Caching (v1.35.0+)
Crucial for large codebases or stable documentation.

```typescript
// Squaads Standard: 1M+ token repository caching
const codebaseCache = await cacheManager.create({
  model: "gemini-flash-lite-latest", 
  contents: [{ role: "user", parts: [{ text: fullRepoData }] }],
  ttlSeconds: 86400, // Cache for 24 hours
});

// Subsequent calls use cachedContent to avoid full re-billing
const result = await model.generateContent({
  cachedContent: codebaseCache.name,
  contents: [{ role: "user", parts: [{ text: "Explain the auth flow." }] }],
});
```

### 3. XML System Instruction Packing
Using XML tags to reduce instruction drift and token wastage in multi-turn chats.

```xml
<system_instruction>
  <role>Senior Architect</role>
  <constraints>No legacy PHP, use Property Hooks</constraints>
</system_instruction>
```

---

## 🚫 The "Do Not List" (Anti-Patterns)
1.  **NEVER** send a full codebase in every prompt. Use **Repomix** for pruning and **Context Caching** for reuse.
2.  **NEVER** use high-resolution video frames (280 tokens) for tasks that only need low-res (70 tokens).
3.  **NEVER** default to Gemini 3 Pro. Always start with Flash-Lite and escalate only if validation fails.
4.  **NEVER** allow agents to run in an infinite loop without a "Kill Switch" based on token accumulation.

---

## 🛠️ Troubleshooting & Usage Audit

| Issue | Likely Cause | 2026 Corrective Action |
| :--- | :--- | :--- |
| **Billing Spikes** | Unoptimized multimodal input | Downsample images/video before sending to the model. |
| **Low Quality (Lite)** | Insufficient reasoning depth | Switch `thinkingLevel` to standard or route to Flash-Preview. |
| **Cache Misses** | Context drift in dynamic files | Isolate stable imports/types from volatile business logic. |
| **Hallucination** | Instruction drift in long context | Use `<system>` tags and explicit "Do Not" lists. |

---

## 📚 Reference Library
- **[Model Selection Matrix](./references/1-model-selection-matrix.md):** Choosing the right model for the job.
- **[Advanced Caching](./references/2-advanced-caching.md):** Mastering TTL and cache warming.
- **[Monitoring & Governance](./references/3-monitoring-and-governance.md):** Tools for tracking ROI.

---

## 📊 Economic Metrics
- **Cost per Feature:** < $0.05 (Target for Squaads agents).
- **Token Efficiency:** > 80% (Knowledge vs Boilerplate).
- **Cache Hit Rate:** > 75% for codebase queries.

---

## 🔄 Evolution of AI Pricing
- **2023:** Fixed per-token pricing (Prohibitive for large context).
- **2024:** First-gen Context Caching (Pro-only).
- **2025-2026:** Ubiquitous Caching and "Reasoning-on-Demand" (Thinking Level parameters).

---

**End of AI Cost Optimizer Standard (v1.1.0)**

*Updated: January 22, 2026 - 23:45*

Overview

This skill is an AI Cost Optimizer that orchestrates model routing, context caching, and token engineering to minimize GenAI operational costs while preserving reasoning quality. It focuses on Gemini 3 family models, multi-layer caching, and structured prompts to reduce waste and control billing. The goal is measurable cost reductions for high-volume agentic environments.

How this skill works

The optimizer inspects task complexity and routes work between Gemini 3 Pro, Flash-Preview, and Flash-Lite with a Thinking-Level parameter to balance cost and depth. It creates and reuses explicit cached context blocks for stable documents and system instructions, and applies XML-style tagging and strict response schemas to cut noise tokens. It also enforces quotas, attribution, and kill-switches to prevent runaway consumption.

When to use it

  • High-volume agent deployments where API costs are a major operating expense
  • Large codebases or documentation sets that are repeatedly queried
  • Pipelines that mix short validation tasks and deep reasoning tasks
  • Systems that need governance over token consumption and billing
  • Use cases requiring predictable latency/cost trade-offs

Best practices

  • Start with Flash-Lite/Flash-Preview for standard tasks and escalate to Pro only when validation fails
  • Create explicit context caches for stable assets and set sensible TTLs (e.g., 24h) to maximize cache hit rate
  • Segment stable imports/types from volatile business logic to avoid cache churn
  • Use Thinking-Level to throttle reasoning depth rather than swapping models frequently
  • Pack system instructions with XML-like tags and require strict response mime types (JSON) to reduce token drift
  • Implement per-agent quotas and a token-based kill-switch to guard against infinite loops

Example use cases

  • Autonomous code assistants that summarize or map large repositories using a cached code index
  • Customer-support agents that validate short responses via Flash-Lite and escalate complex policy questions to Flash-Preview
  • Batch extraction pipelines that reuse cached context blocks to extract entities from stable documents
  • Monitoring agents enforcing per-run token caps and routing economic fallbacks when thresholds are hit
  • Feature-cost dashboards computing cost-per-feature and token-efficiency metrics

FAQ

How much cost reduction can I expect?

Typical reductions range up to 70–90% on high-volume flows when using multi-layer caching, thinking-level routing, and token engineering together, though results depend on workload characteristics.

When should I use Gemini 3 Pro?

Reserve Pro for mission-critical, deep-reasoning tasks that fail validation on Flash-Preview or Flash-Lite; default to lower-tier models and escalate only after verification.