home / skills / orchestra-research / ai-research-skills / tensorrt-llm

This skill optimizes LLM inference on NVIDIA GPUs with TensorRT for maximum throughput and lowest latency in production.

npx playbooks add skill orchestra-research/ai-research-skills --skill tensorrt-llm

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
4.9 KB
---
name: tensorrt-llm
description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
dependencies: [tensorrt-llm, torch]
---

# TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

## When to use TensorRT-LLM

**Use TensorRT-LLM when:**
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes

**Use vLLM instead when:**
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware

**Use llama.cpp instead when:**
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format

## Quick start

### Installation

```bash
# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
```

### Basic inference

```python
from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)
```

### Serving with trtllm-serve

```bash
# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

## Key features

### Performance optimizations
- **In-flight batching**: Dynamic batching during generation
- **Paged KV cache**: Efficient memory management
- **Flash Attention**: Optimized attention kernels
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
- **CUDA graphs**: Reduced kernel launch overhead

### Parallelism
- **Tensor parallelism (TP)**: Split model across GPUs
- **Pipeline parallelism (PP)**: Layer-wise distribution
- **Expert parallelism**: For Mixture-of-Experts models
- **Multi-node**: Scale beyond single machine

### Advanced features
- **Speculative decoding**: Faster generation with draft models
- **LoRA serving**: Efficient multi-adapter deployment
- **Disaggregated serving**: Separate prefill and generation

## Common patterns

### Quantized model (FP8)

```python
from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])
```

### Multi-GPU deployment

```python
# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)
```

### Batch inference

```python
# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput
```

## Performance benchmarks

**Meta Llama 3-8B** (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: **100× faster**

**Llama 3-70B** (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8

## Supported models

- **LLaMA family**: Llama 2, Llama 3, CodeLlama
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
- **Qwen**: Qwen, Qwen2, QwQ
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
- **Vision**: LLaVA, Phi-3-vision
- **100+ models** on HuggingFace

## References

- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling

## Resources

- **Docs**: https://nvidia.github.io/TensorRT-LLM/
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
- **Models**: https://huggingface.co/models?library=tensorrt_llm


Overview

This skill optimizes large language model inference using NVIDIA TensorRT to deliver maximum throughput and minimal latency on NVIDIA GPUs. It targets production deployments on A100/H100/GB200-class hardware and supports quantization, multi-GPU scaling, and in-flight batching for real-time and high-throughput use cases.

How this skill works

The skill compiles and runs LLMs with TensorRT kernels, leveraging optimized attention, paged KV cache, CUDA graphs, and hardware-aware quantization (FP8/INT4/FP4) to reduce compute and memory overhead. It exposes APIs for model loading, sampling, and serving, and supports tensor/pipeline parallelism, speculative decoding, and automatic in-flight batching to maximize utilization across GPUs.

When to use it

  • Deploying LLMs on NVIDIA GPUs (A100, H100, GB200) for production inference
  • Needing 10–100× faster inference than standard PyTorch runtimes
  • Serving quantized models (FP8, INT4, FP4) to reduce memory and cost
  • Scaling inference across multiple GPUs or nodes with tensor/pipeline parallelism
  • Low-latency or real-time generation where per-token latency matters

Best practices

  • Use Docker images or pinned pip releases matched to your CUDA and TensorRT versions to avoid compatibility issues
  • Precompile and warm caches for production models to eliminate first-request stalls
  • Choose quantization (FP8/INT4) after validating accuracy trade-offs on representative inputs
  • Tune tensor/pipeline parallelism to match GPU memory and interconnect topology
  • Enable in-flight batching and set max_batch_size based on typical traffic to maximize throughput

Example use cases

  • Real-time chat assistants requiring sub-20ms token latency on H100
  • High-volume API serving with 10k+ tokens/sec throughput using in-flight batching
  • Cost-efficient hosting of large models using FP8 quantization to halve memory footprint
  • Multi-GPU deployment of Llama-3-70B across 8 A100s for high-concurrency inference
  • Edge server farms using trtllm-serve with automatic model download and compilation

FAQ

Which GPUs and software versions are required?

TensorRT-LLM targets NVIDIA GPUs (A100/H100/GB200) and requires compatible CUDA and TensorRT versions; use the official Docker image or check release notes for exact CUDA/TensorRT/Python compatibility.

When should I prefer vLLM or llama.cpp instead?

Use vLLM for simpler Python-first setups or PagedAttention without compilation; use llama.cpp for CPU or Apple Silicon edge deployments where NVIDIA GPUs are not available.