home / skills / orchestra-research / ai-research-skills / tensorrt-llm

tensorrt-llm skill

This skill optimizes LLM inference on NVIDIA GPUs with TensorRT for maximum throughput and lowest latency in production.

npx playbooks add skill orchestra-research/ai-research-skills --skill tensorrt-llm

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

4.9 KB

---
name: tensorrt-llm
description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
dependencies: [tensorrt-llm, torch]
---

# TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

## When to use TensorRT-LLM

**Use TensorRT-LLM when:**
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes

**Use vLLM instead when:**
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware

**Use llama.cpp instead when:**
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format

## Quick start

### Installation

```bash
# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
```

### Basic inference

```python
from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)
```

### Serving with trtllm-serve

```bash
# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

## Key features

### Performance optimizations
- **In-flight batching**: Dynamic batching during generation
- **Paged KV cache**: Efficient memory management
- **Flash Attention**: Optimized attention kernels
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
- **CUDA graphs**: Reduced kernel launch overhead

### Parallelism
- **Tensor parallelism (TP)**: Split model across GPUs
- **Pipeline parallelism (PP)**: Layer-wise distribution
- **Expert parallelism**: For Mixture-of-Experts models
- **Multi-node**: Scale beyond single machine

### Advanced features
- **Speculative decoding**: Faster generation with draft models
- **LoRA serving**: Efficient multi-adapter deployment
- **Disaggregated serving**: Separate prefill and generation

## Common patterns

### Quantized model (FP8)

```python
from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])
```

### Multi-GPU deployment

```python
# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)
```

### Batch inference

```python
# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput
```

## Performance benchmarks

**Meta Llama 3-8B** (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: **100× faster**

**Llama 3-70B** (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8

## Supported models

- **LLaMA family**: Llama 2, Llama 3, CodeLlama
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
- **Qwen**: Qwen, Qwen2, QwQ
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
- **Vision**: LLaVA, Phi-3-vision
- **100+ models** on HuggingFace

## References

- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling

## Resources

- **Docs**: https://nvidia.github.io/TensorRT-LLM/
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
- **Models**: https://huggingface.co/models?library=tensorrt_llm

Overview

This skill optimizes large language model inference using NVIDIA TensorRT to deliver maximum throughput and minimal latency on NVIDIA GPUs. It targets production deployments on A100/H100/GB200-class hardware and supports quantization, multi-GPU scaling, and in-flight batching for real-time and high-throughput use cases.

How this skill works

The skill compiles and runs LLMs with TensorRT kernels, leveraging optimized attention, paged KV cache, CUDA graphs, and hardware-aware quantization (FP8/INT4/FP4) to reduce compute and memory overhead. It exposes APIs for model loading, sampling, and serving, and supports tensor/pipeline parallelism, speculative decoding, and automatic in-flight batching to maximize utilization across GPUs.

When to use it

Deploying LLMs on NVIDIA GPUs (A100, H100, GB200) for production inference
Needing 10–100× faster inference than standard PyTorch runtimes
Serving quantized models (FP8, INT4, FP4) to reduce memory and cost
Scaling inference across multiple GPUs or nodes with tensor/pipeline parallelism
Low-latency or real-time generation where per-token latency matters

Best practices

Use Docker images or pinned pip releases matched to your CUDA and TensorRT versions to avoid compatibility issues
Precompile and warm caches for production models to eliminate first-request stalls
Choose quantization (FP8/INT4) after validating accuracy trade-offs on representative inputs
Tune tensor/pipeline parallelism to match GPU memory and interconnect topology
Enable in-flight batching and set max_batch_size based on typical traffic to maximize throughput

Example use cases

Real-time chat assistants requiring sub-20ms token latency on H100
High-volume API serving with 10k+ tokens/sec throughput using in-flight batching
Cost-efficient hosting of large models using FP8 quantization to halve memory footprint
Multi-GPU deployment of Llama-3-70B across 8 A100s for high-concurrency inference
Edge server farms using trtllm-serve with automatic model download and compilation

FAQ

Which GPUs and software versions are required?

TensorRT-LLM targets NVIDIA GPUs (A100/H100/GB200) and requires compatible CUDA and TensorRT versions; use the official Docker image or check release notes for exact CUDA/TensorRT/Python compatibility.

When should I prefer vLLM or llama.cpp instead?

Use vLLM for simpler Python-first setups or PagedAttention without compilation; use llama.cpp for CPU or Apple Silicon edge deployments where NVIDIA GPUs are not available.