home / skills / orchestra-research / ai-research-skills / nemo-evaluator
This skill helps you benchmark LLMs across 100+ benchmarks with containerized, scalable evaluation on local Docker, Slurm HPC, or cloud platforms.
npx playbooks add skill orchestra-research/ai-research-skills --skill nemo-evaluatorReview the files below or copy the command above to add this skill to your agents.
---
name: nemo-evaluator-sdk
description: Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Evaluation, NeMo, NVIDIA, Benchmarking, MMLU, HumanEval, Multi-Backend, Slurm, Docker, Reproducible, Enterprise]
dependencies: [nemo-evaluator-launcher>=0.1.25, docker]
---
# NeMo Evaluator SDK - Enterprise LLM Benchmarking
## Quick Start
NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
**Installation**:
```bash
pip install nemo-evaluator-launcher
```
**Set API key and run evaluation**:
```bash
export NGC_API_KEY=nvapi-your-key-here
# Create minimal config
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
```
**View available tasks**:
```bash
nemo-evaluator-launcher ls tasks
```
## Common Workflows
### Workflow 1: Evaluate Model on Standard Benchmarks
Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.
**Checklist**:
```
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
```
**Step 1: Configure API endpoint**
```yaml
# config.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
```
For self-hosted endpoints (vLLM, TRT-LLM):
```yaml
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: "" # No key needed for local
```
**Step 2: Select benchmarks**
Add tasks to your config:
```yaml
evaluation:
tasks:
- name: ifeval # Instruction following
- name: gpqa_diamond # Graduate-level QA
env_vars:
HF_TOKEN: HF_TOKEN # Some tasks need HF token
- name: gsm8k_cot_instruct # Math reasoning
- name: humaneval # Code generation
```
**Step 3: Run evaluation**
```bash
# Run with config file
nemo-evaluator-launcher run \
--config-dir . \
--config-name config
# Override output directory
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o execution.output_dir=./my_results
# Limit samples for quick testing
nemo-evaluator-launcher run \
--config-dir . \
--config-name config \
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
```
**Step 4: Check results**
```bash
# Check job status
nemo-evaluator-launcher status <invocation_id>
# List all runs
nemo-evaluator-launcher ls runs
# View results
cat results/<invocation_id>/<task>/artifacts/results.yml
```
### Workflow 2: Run Evaluation on Slurm HPC Cluster
Execute large-scale evaluation on HPC infrastructure.
**Checklist**:
```
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
```
**Step 1: Configure Slurm settings**
```yaml
# slurm_config.yaml
defaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
```
**Step 2: Set up model deployment**
```yaml
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4
max_model_len: 4096
target:
api_endpoint:
model_id: llama-3.1-8b
# URL auto-generated by deployment
```
**Step 3: Launch evaluation**
```bash
nemo-evaluator-launcher run \
--config-dir . \
--config-name slurm_config
```
**Step 4: Monitor job status**
```bash
# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>
# View detailed info
nemo-evaluator-launcher info <invocation_id>
# Kill if needed
nemo-evaluator-launcher kill <invocation_id>
```
### Workflow 3: Compare Multiple Models
Benchmark multiple models on the same tasks for comparison.
**Checklist**:
```
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
```
**Step 1: Create base config**
```yaml
# base_eval.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./comparison_results
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.01
parallelism: 4
tasks:
- name: mmlu_pro
- name: gsm8k_cot_instruct
- name: ifeval
```
**Step 2: Run evaluations with model overrides**
```bash
# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
# Evaluate Mistral 7B
nemo-evaluator-launcher run \
--config-dir . \
--config-name base_eval \
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
```
**Step 3: Export and compare**
```bash
# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json
# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
```
### Workflow 4: Safety and Vision-Language Evaluation
Evaluate models on safety benchmarks and VLM tasks.
**Checklist**:
```
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
```
**Step 1: Configure safety tasks**
```yaml
evaluation:
tasks:
- name: aegis # Safety harness
- name: wildguard # Safety classification
- name: garak # Security probing
```
**Step 2: Configure VLM tasks**
```yaml
# For vision-language models
target:
api_endpoint:
type: vlm # Vision-language endpoint
model_id: nvidia/llama-3.2-90b-vision-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: ocrbench # OCR evaluation
- name: chartqa # Chart understanding
- name: mmmu # Multimodal understanding
```
## When to Use vs Alternatives
**Use NeMo Evaluator when:**
- Need **100+ benchmarks** from 18+ harnesses in one platform
- Running evaluations on **Slurm HPC clusters** or cloud
- Requiring **reproducible** containerized evaluation
- Evaluating against **OpenAI-compatible APIs** (vLLM, TRT-LLM, NIMs)
- Need **enterprise-grade** evaluation with result export (MLflow, W&B)
**Use alternatives instead:**
- **lm-evaluation-harness**: Simpler setup for quick local evaluation
- **bigcode-evaluation-harness**: Focused only on code benchmarks
- **HELM**: Stanford's broader evaluation (fairness, efficiency)
- **Custom scripts**: Highly specialized domain evaluation
## Supported Harnesses and Tasks
| Harness | Task Count | Categories |
|---------|-----------|------------|
| `lm-evaluation-harness` | 60+ | MMLU, GSM8K, HellaSwag, ARC |
| `simple-evals` | 20+ | GPQA, MATH, AIME |
| `bigcode-evaluation-harness` | 25+ | HumanEval, MBPP, MultiPL-E |
| `safety-harness` | 3 | Aegis, WildGuard |
| `garak` | 1 | Security probing |
| `vlmevalkit` | 6+ | OCRBench, ChartQA, MMMU |
| `bfcl` | 6 | Function calling v2/v3 |
| `mtbench` | 2 | Multi-turn conversation |
| `livecodebench` | 10+ | Live coding evaluation |
| `helm` | 15 | Medical domain |
| `nemo-skills` | 8 | Math, science, agentic |
## Common Issues
**Issue: Container pull fails**
Ensure NGC credentials are configured:
```bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
```
**Issue: Task requires environment variable**
Some tasks need HF_TOKEN or JUDGE_API_KEY:
```yaml
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # Maps env var name to env var
```
**Issue: Evaluation timeout**
Increase parallelism or reduce samples:
```bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
```
**Issue: Slurm job not starting**
Check Slurm account and partition:
```yaml
execution:
account: correct_account
partition: gpu
qos: normal # May need specific QOS
```
**Issue: Different results than expected**
Verify configuration matches reported settings:
```yaml
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0 # Deterministic
num_fewshot: 5 # Check paper's fewshot count
```
## CLI Reference
| Command | Description |
|---------|-------------|
| `run` | Execute evaluation with config |
| `status <id>` | Check job status |
| `info <id>` | View detailed job info |
| `ls tasks` | List available benchmarks |
| `ls runs` | List all invocations |
| `export <id>` | Export results (mlflow/wandb/local) |
| `kill <id>` | Terminate running job |
## Configuration Override Examples
```bash
# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel
# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
```
## Python API Usage
For programmatic evaluation without the CLI:
```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)
# Configure evaluation
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=4
)
)
# Configure target endpoint
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here"
)
)
# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
```
## Advanced Topics
**Multi-backend execution**: See [references/execution-backends.md](references/execution-backends.md)
**Configuration deep-dive**: See [references/configuration.md](references/configuration.md)
**Adapter and interceptor system**: See [references/adapter-system.md](references/adapter-system.md)
**Custom benchmark integration**: See [references/custom-benchmarks.md](references/custom-benchmarks.md)
## Requirements
- **Python**: 3.10-3.13
- **Docker**: Required for local execution
- **NGC API Key**: For pulling containers and using NVIDIA Build
- **HF_TOKEN**: Required for some benchmarks (GPQA, MMLU)
## Resources
- **GitHub**: https://github.com/NVIDIA-NeMo/Evaluator
- **NGC Containers**: nvcr.io/nvidia/eval-factory/
- **NVIDIA Build**: https://build.nvidia.com (free hosted models)
- **Documentation**: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs
This skill evaluates large language models across 100+ benchmarks from 18+ harnesses using a container-first, reproducible platform. It supports multi-backend execution (local Docker, Slurm HPC, cloud) and exports results to MLflow, W&B or local formats for enterprise-grade benchmarking. Use it to run standard academic tasks, safety checks, vision-language tests, and large-scale comparisons across models.
The evaluator packages benchmarks and execution logic into containers and runs them against any OpenAI-compatible endpoint or self-hosted API (vLLM, TRT-LLM, NIMs). It orchestrates tasks, manages environment variables, and collects artifacts and metrics per invocation. Multi-backend support lets you run on local Docker, Slurm clusters, or cloud services while preserving reproducibility and exportable results.
What runtimes/backends are supported?
Local Docker, Slurm HPC, and cloud deployments (Lepton/NGC). The same configs work across backends for reproducibility.
What do I need to pull containers?
An NGC API key and docker login to nvcr.io are required to pull enterprise images; local self-hosted endpoints can run without NGC keys.
How do I limit run time for quick tests?
Override evaluation.nemo_evaluator_config.config.params.limit_samples and parallelism via CLI -o flags to reduce sample count and speed up runs.