home / skills / doanchienthangdev / omgkit / inference-optimization

inference-optimization skill

safe

/plugin/skills/ai-engineering/inference-optimization

This skill helps optimize AI inference latency and cost by applying quantization, caching, batching, and speculative decoding techniques.

npx playbooks add skill doanchienthangdev/omgkit --skill inference-optimization

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.4 KB

---
name: inference-optimization
description: Optimizing AI inference - quantization, speculative decoding, KV cache, batching, caching strategies. Use when reducing latency, lowering costs, or scaling AI serving.
---

# Inference Optimization Skill

Making AI inference faster and cheaper.

## Performance Metrics

```python
@dataclass
class InferenceMetrics:
    ttft: float   # Time to First Token (seconds)
    tpot: float   # Time Per Output Token
    throughput: float  # Tokens/second
    latency: float     # Total time
```

## Model Optimization

### Quantization

```python
# 8-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

# 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# GPTQ (better 4-bit)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-GPTQ"
)

# AWQ (best for inference)
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ",
    fuse_layers=True
)
```

### Speculative Decoding

```python
def speculative_decode(target, draft, prompt, k=4):
    """Small model drafts, large model verifies."""
    input_ids = tokenize(prompt)

    while not complete(input_ids):
        # Draft k tokens
        draft_ids = draft.generate(input_ids, max_new_tokens=k)

        # Verify with target (single forward!)
        logits = target(draft_ids).logits

        # Accept matching
        accepted = verify_and_accept(draft_ids, logits)
        input_ids = torch.cat([input_ids, accepted], dim=-1)

    return decode(input_ids)
```

## Service Optimization

### KV Cache (vLLM)
```python
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    gpu_memory_utilization=0.9,
    max_model_len=4096,
    enable_prefix_caching=True  # Reuse common prefixes
)
```

### Batching
```python
# Continuous batching (vLLM, TGI)
# Dynamic add/remove requests

# Dynamic batching
class DynamicBatcher:
    def __init__(self, max_batch=8, max_wait_ms=100):
        self.queue = []
        self.max_batch = max_batch
        self.max_wait = max_wait_ms

    async def add(self, request):
        future = asyncio.Future()
        self.queue.append((request, future))

        if len(self.queue) >= self.max_batch:
            await self.process_batch()

        return await future
```

## Caching

### Exact Cache
```python
class PromptCache:
    def get_or_generate(self, prompt, model):
        key = hash(prompt)

        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)

        response = model.generate(prompt)
        self.redis.setex(key, 3600, json.dumps(response))
        return response
```

### Semantic Cache
```python
class SemanticCache:
    def get_or_generate(self, prompt, model, threshold=0.95):
        emb = self.embed(prompt)

        for cached, cached_emb in self.embeddings.items():
            if cosine_similarity(emb, cached_emb) > threshold:
                return self.responses[cached]

        response = model.generate(prompt)
        self.embeddings[prompt] = emb
        self.responses[prompt] = response
        return response
```

## Best Practices

1. Start with quantization (easy win)
2. Use vLLM/TGI for serving
3. Enable prefix caching
4. Add semantic caching for common queries
5. Monitor TTFT and throughput

Overview

This skill explains practical techniques to optimize AI model inference for lower latency and cost. It covers model-level strategies (quantization, GPTQ/AWQ, speculative decoding) and service-level tactics (KV caching, batching, semantic and exact caches). The goal is to reduce Time to First Token (TTFT), lower time-per-token, and increase throughput for production serving.

How this skill works

It inspects and applies optimizations at two layers: model runtime and serving architecture. Model runtime changes use lower-precision weights (8-bit, 4-bit, GPTQ, AWQ) and speculative decoding where a small draft model proposes tokens that a larger target model verifies. Serving optimizations reuse computation via KV/prefix caches, group requests with dynamic batching, and avoid repeated work with exact or semantic response caches.

When to use it

When TTFT or per-token latency is harming user experience
When GPU/CPU cost must be reduced without major accuracy loss
When scaling to many concurrent requests or multitenant workloads
When many requests share similar prompts or prefixes
When deploying large models that exceed memory or throughput targets

Best practices

Start with quantization (8-bit then evaluate 4-bit/GPTQ/AWQ) to get quick cost and memory wins
Measure TTFT, time-per-token, throughput, and tail latency before/after changes
Use vLLM or TGI for efficient serving and enable prefix/KV caching when requests repeat
Combine speculative decoding with verification to reduce GPU steps for long outputs
Add semantic caching for high-frequency similar prompts and exact caching for deterministic calls
Implement dynamic batching with configurable max batch size and max wait to balance latency vs throughput

Example use cases

Chat agents where TTFT must be under 200–500 ms for user interactivity
High-volume API serving to reduce per-request GPU cost via quantization and caching
Multitenant deployments where many tenants reuse common system prompts or context
Batch document generation workloads that benefit from larger throughput via dynamic batching
Edge or resource-constrained inference where 4-bit/GPTQ/AWQ reduces memory footprint

FAQ

Will quantization always reduce accuracy?

Quantization typically has a small accuracy impact; start with 8-bit then profile 4-bit/GPTQ/AWQ and validate across representative inputs.

When should I add semantic cache versus exact cache?

Use exact cache for repeat identical prompts and deterministic outputs. Add semantic cache when many prompts are paraphrases or share intent to reuse similar responses.