home / skills / eyadsibai / ltk / llm-inference

This skill helps you compare and optimize LLM inference engines for production, local, and edge deployments to maximize throughput and efficiency.

npx playbooks add skill eyadsibai/ltk --skill llm-inference

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.4 KB
---
name: llm-inference
description: Use when "LLM inference", "serving LLM", "vLLM", "llama.cpp", "GGUF", "text generation", "model serving", "inference optimization", "KV cache", "continuous batching", "speculative decoding", "local LLM", "CPU inference"
version: 1.0.0
---

# LLM Inference

High-performance inference engines for serving large language models.

---

## Engine Comparison

| Engine | Best For | Hardware | Throughput | Setup |
|--------|----------|----------|------------|-------|
| **vLLM** | Production serving | GPU | Highest | Medium |
| **llama.cpp** | Local/edge, CPU | CPU/GPU | Good | Easy |
| **TGI** | HuggingFace models | GPU | High | Easy |
| **Ollama** | Local desktop | CPU/GPU | Good | Easiest |
| **TensorRT-LLM** | NVIDIA production | NVIDIA GPU | Highest | Complex |

---

## Decision Guide

| Scenario | Recommendation |
|----------|----------------|
| Production API server | vLLM or TGI |
| Maximum throughput | vLLM |
| Local development | Ollama or llama.cpp |
| CPU-only deployment | llama.cpp |
| Edge/embedded | llama.cpp |
| Apple Silicon | llama.cpp with Metal |
| Quick experimentation | Ollama |
| Privacy-sensitive (no cloud) | llama.cpp |

---

## vLLM

Production-grade serving with PagedAttention for optimal GPU memory usage.

### Key Innovations

| Feature | What It Does |
|---------|--------------|
| **PagedAttention** | Non-contiguous KV cache, better memory utilization |
| **Continuous batching** | Dynamic request grouping for throughput |
| **Speculative decoding** | Small model drafts, large model verifies |

**Strengths**: Highest throughput, OpenAI-compatible API, multi-GPU
**Limitations**: GPU required, more complex setup

**Key concept**: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.

---

## llama.cpp

C++ inference for running models anywhere—laptops, phones, Raspberry Pi.

### Quantization Formats (GGUF)

| Format | Size (7B) | Quality | Use Case |
|--------|-----------|---------|----------|
| **Q8_0** | ~7 GB | Highest | When you have RAM |
| **Q6_K** | ~6 GB | High | Good balance |
| **Q5_K_M** | ~5 GB | Good | Balanced |
| **Q4_K_M** | ~4 GB | OK | Memory constrained |
| **Q2_K** | ~2.5 GB | Low | Minimum viable |

**Recommendation**: Q4_K_M for best quality/size balance.

### Memory Requirements

| Model Size | Q4_K_M | RAM Needed |
|------------|--------|------------|
| 7B | ~4 GB | 8 GB |
| 13B | ~7 GB | 16 GB |
| 30B | ~17 GB | 32 GB |
| 70B | ~38 GB | 64 GB |

### Platform Optimization

| Platform | Key Setting |
|----------|-------------|
| **Apple Silicon** | `n_gpu_layers=-1` (Metal offload) |
| **CUDA GPU** | `n_gpu_layers=-1` + `offload_kqv=True` |
| **CPU only** | `n_gpu_layers=0` + set `n_threads` to core count |

**Strengths**: Runs anywhere, GGUF format, Metal/CUDA support
**Limitations**: Lower throughput than vLLM, single-user focused

**Key concept**: GGUF format + quantization = run large models on consumer hardware.

---

## Key Optimization Concepts

| Technique | What It Does | When to Use |
|-----------|--------------|-------------|
| **KV Cache** | Reuse attention computations | Always (automatic) |
| **Continuous Batching** | Group requests dynamically | High-throughput serving |
| **Tensor Parallelism** | Split model across GPUs | Large models |
| **Quantization** | Reduce precision (fp16→int4) | Memory constrained |
| **Speculative Decoding** | Small model drafts, large verifies | Latency sensitive |
| **GPU Offloading** | Move layers to GPU | When GPU available |

---

## Common Parameters

| Parameter | Purpose | Typical Value |
|-----------|---------|---------------|
| **n_ctx** | Context window size | 2048-8192 |
| **n_gpu_layers** | Layers to offload | -1 (all) or 0 (none) |
| **temperature** | Randomness | 0.0-1.0 |
| **max_tokens** | Output limit | 100-2000 |
| **n_threads** | CPU threads | Match core count |

---

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Out of memory | Reduce n_ctx, use smaller quant |
| Slow inference | Enable GPU offload, use faster quant |
| Model won't load | Check GGUF integrity, check RAM |
| Metal not working | Reinstall with `-DLLAMA_METAL=on` |
| Poor quality | Use higher quant (Q5_K_M, Q6_K) |

## Resources

- vLLM: <https://docs.vllm.ai>
- llama.cpp: <https://github.com/ggerganov/llama.cpp>
- TGI: <https://huggingface.co/docs/text-generation-inference>
- Ollama: <https://ollama.ai>
- GGUF Models: <https://huggingface.co/TheBloke>

Overview

This skill provides practical guidance for selecting and operating high-performance LLM inference engines for GPU and CPU deployments. It summarizes trade-offs between engines like vLLM and llama.cpp, quantization strategies (GGUF), and key optimizations for throughput and memory. Use it to design serving stacks for production APIs, local desktop inference, or edge devices.

How this skill works

The skill inspects engine strengths and limitations, maps scenarios to recommendations (e.g., production vs. local), and explains core techniques such as KV caching, continuous batching, speculative decoding, quantization, and GPU offloading. It translates those concepts into configuration tips (n_ctx, n_gpu_layers, n_threads) and platform-specific settings for Metal/CUDA. It also provides troubleshooting steps for common OOM and performance issues.

When to use it

  • Building a production LLM API with maximum throughput and OpenAI-compatible endpoints
  • Running local or edge inference on laptops, phones, or Raspberry Pi
  • Deploying CPU-only inference or Apple Silicon optimized workloads
  • Experimenting quickly on a desktop with minimal setup
  • Optimizing latency-sensitive or memory-constrained deployments

Best practices

  • Prefer vLLM for GPU production serving when you need highest throughput and OpenAI API compatibility
  • Use llama.cpp with GGUF and Q4_K_M quantization for best quality/size balance on consumer hardware
  • Always enable KV cache reuse and tune n_ctx to fit memory constraints
  • Use continuous batching for high request volumes and speculative decoding for latency-sensitive workloads
  • Match n_threads to CPU cores and set n_gpu_layers/offload settings per platform (Metal or CUDA)

Example use cases

  • High-throughput chat API: deploy vLLM across multiple GPUs with continuous batching and PagedAttention
  • Local developer workflow: run GGUF-quantized models with llama.cpp on a laptop for privacy-focused testing
  • Edge/embedded inference: deploy Q4_K_M or Q2_K quantized models on Raspberry Pi or mobile devices
  • Apple Silicon deployment: enable Metal offload with n_gpu_layers=-1 to accelerate inference
  • Cost-conscious CPU server: use llama.cpp with appropriate n_threads and smaller quant formats

FAQ

Which engine is best for maximum throughput?

vLLM generally delivers the highest throughput on GPUs thanks to PagedAttention and continuous batching.

How do I run large models on limited RAM?

Use GGUF quantization (e.g., Q4_K_M) to reduce model size, lower n_ctx, and enable GPU offload or layer offloading when available.