home / skills / eyadsibai / ltk / context-optimization

context-optimization skill

/plugins/ltk-core/skills/context-optimization

This skill helps optimize agent context by applying compression, masking, caching, and partitioning to reduce tokens while preserving essential information.

npx playbooks add skill eyadsibai/ltk --skill context-optimization

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.7 KB
---
name: context-optimization
description: Use when optimizing agent context, reducing token costs, implementing KV-cache optimization, or asking about "context optimization", "token reduction", "context limits", "observation masking", "context budgeting", "context partitioning"
version: 1.0.0
---

# Context Optimization Techniques

Extend effective context capacity through compression, masking, caching, and partitioning. Effective optimization can 2-3x effective context capacity without larger models.

## Optimization Strategies

| Strategy | Token Reduction | Use Case |
|----------|-----------------|----------|
| Compaction | 50-70% | Message history dominates |
| Observation Masking | 60-80% | Tool outputs dominate |
| KV-Cache Optimization | 70%+ cache hits | Stable workloads |
| Context Partitioning | Variable | Complex multi-task |

## Compaction

Summarize context when approaching limits:

```python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
```

**Priority for compression**:

1. Tool outputs → replace with summaries
2. Old turns → summarize early conversation
3. Retrieved docs → summarize if recent versions exist
4. **Never compress** system prompt

**Summary generation by type**:

- **Tool outputs**: Preserve findings, metrics, conclusions
- **Conversational**: Preserve decisions, commitments, context shifts
- **Documents**: Preserve key facts, remove supporting evidence

## Observation Masking

Tool outputs can be 80%+ of tokens. Replace verbose outputs with references:

```python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
```

**Masking rules**:

- **Never mask**: Current task critical, most recent turn, active reasoning
- **Consider**: 3+ turns old, key points extractable, purpose served
- **Always mask**: Repeated outputs, boilerplate, already summarized

## KV-Cache Optimization

Cache Key/Value tensors for requests with identical prefixes:

```python
# Cache-friendly ordering: stable content first
context = [
    system_prompt,      # Cacheable
    tool_definitions,   # Cacheable
    reused_templates,   # Reusable
    unique_content      # Unique
]
```

**Design for cache stability**:

- Avoid dynamic content (timestamps)
- Use consistent formatting
- Keep structure stable across sessions

## Context Partitioning

Split work across sub-agents with isolated contexts:

```python
# Each sub-agent has clean, focused context
results = await gather(
    research_agent.search("topic A"),
    research_agent.search("topic B"),
    research_agent.search("topic C")
)
# Coordinator synthesizes without carrying full context
synthesized = await coordinator.synthesize(results)
```

## Budget Management

```python
context_budget = {
    "system_prompt": 2000,
    "tool_definitions": 3000,
    "retrieved_docs": 10000,
    "message_history": 15000,
    "reserved_buffer": 2000
}
# Monitor and trigger optimization at 70-80%
```

## When to Optimize

| Signal | Action |
|--------|--------|
| Utilization >70% | Start monitoring |
| Utilization >80% | Apply compaction |
| Quality degradation | Investigate cause |
| Tool outputs dominate | Observation masking |
| Docs dominate | Summarization/partitioning |

## Performance Targets

- Compaction: 50-70% reduction, <5% quality loss
- Masking: 60-80% reduction in masked observations
- Cache: 70%+ hit rate for stable workloads

## Best Practices

1. Measure before optimizing
2. Apply compaction before masking
3. Design for cache stability
4. Partition before context becomes problematic
5. Monitor effectiveness over time
6. Balance token savings vs quality
7. Test at production scale
8. Implement graceful degradation

Overview

This skill shows practical techniques to extend effective agent context capacity by compressing, masking, caching, and partitioning context. It focuses on reducing token costs and preserving quality so agents can handle larger workflows without switching to bigger models. The guidance targets engineers optimizing context budgets, KV-cache behavior, and observation handling.

How this skill works

The skill inspects runtime context usage and applies targeted optimizations: compaction summarizes low-value content, observation masking replaces large tool outputs with compact references, KV-cache optimization arranges stable content for high cache hits, and context partitioning splits work across sub-agents. It monitors utilization thresholds and triggers actions when usage crosses configured budgets to avoid quality loss while maximizing effective capacity.

When to use it

  • When token utilization approaches 70–80% of model context limits
  • When tool outputs or retrieved documents dominate token counts
  • To reduce per-call costs without changing model size
  • When you need stable KV-cache hit rates for repeated prefixes
  • When a task can be split across focused sub-agents to avoid carrying all context

Best practices

  • Measure baseline token usage and quality impact before applying changes
  • Apply compaction first (summarize history), then mask repetitive large outputs
  • Design prompts and content ordering for KV-cache stability (static first)
  • Avoid compressing the system prompt or very recent critical turns
  • Partition tasks proactively rather than waiting for severe context pressure
  • Monitor effectiveness and quality; tune thresholds and summaries over time

Example use cases

  • A chat agent that summarizes older turns when history exceeds 80% of context
  • A pipeline that stores verbose tool outputs and returns short observation references
  • A retrieval system that summarizes documents when retrieved-doc tokens dominate
  • A stable workload using cached key/value tensors by placing system and tool definitions first
  • A multitask coordinator that runs sub-agents for each research thread and synthesizes results

FAQ

Will compaction reduce answer quality?

Compaction can slightly reduce detail if aggressive; aim for 50–70% token reduction and validate that critical facts and decisions are preserved.

When should I mask observations versus summarize them?

Mask repeated or boilerplate outputs and use summaries for unique but verbose outputs; never mask the most recent or task-critical observations.