home / skills / orchestra-research / ai-research-skills / nemo-evaluator

nemo-evaluator skill

unsafe

This skill helps you benchmark LLMs across 100+ benchmarks with containerized, scalable evaluation on local Docker, Slurm HPC, or cloud platforms.

npx playbooks add skill orchestra-research/ai-research-skills --skill nemo-evaluator

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

11.9 KB

---
name: nemo-evaluator-sdk
description: Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Evaluation, NeMo, NVIDIA, Benchmarking, MMLU, HumanEval, Multi-Backend, Slurm, Docker, Reproducible, Enterprise]
dependencies: [nemo-evaluator-launcher>=0.1.25, docker]
---

# NeMo Evaluator SDK - Enterprise LLM Benchmarking

## Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

**Installation**:
```bash
pip install nemo-evaluator-launcher
```

**Set API key and run evaluation**:
```bash
export NGC_API_KEY=nvapi-your-key-here

# Create minimal config
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
```

**View available tasks**:
```bash
nemo-evaluator-launcher ls tasks
```

## Common Workflows

### Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

**Checklist**:
```
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
```

**Step 1: Configure API endpoint**

```yaml
# config.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY
```

For self-hosted endpoints (vLLM, TRT-LLM):
```yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local
```

**Step 2: Select benchmarks**

Add tasks to your config:
```yaml
evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation
```

**Step 3: Run evaluation**

```bash
# Run with config file
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config

# Override output directory
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o execution.output_dir=./my_results

# Limit samples for quick testing
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
```

**Step 4: Check results**

```bash
# Check job status
nemo-evaluator-launcher status <invocation_id>

# List all runs
nemo-evaluator-launcher ls runs

# View results
cat results/<invocation_id>/<task>/artifacts/results.yml
```

### Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

**Checklist**:
```
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
```

**Step 1: Configure Slurm settings**

```yaml
# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8
```

**Step 2: Set up model deployment**

```yaml
deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment
```

**Step 3: Launch evaluation**

```bash
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config
```

**Step 4: Monitor job status**

```bash
# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>

# View detailed info
nemo-evaluator-launcher info <invocation_id>

# Kill if needed
nemo-evaluator-launcher kill <invocation_id>
```

### Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

**Checklist**:
```
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
```

**Step 1: Create base config**

```yaml
# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval
```

**Step 2: Run evaluations with model overrides**

```bash
# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# Evaluate Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
```

**Step 3: Export and compare**

```bash
# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
```

### Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

**Checklist**:
```
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
```

**Step 1: Configure safety tasks**

```yaml
evaluation:
  tasks:
    - name: aegis              # Safety harness
    - name: wildguard          # Safety classification
    - name: garak              # Security probing
```

**Step 2: Configure VLM tasks**

```yaml
# For vision-language models
target:
  api_endpoint:
    type: vlm  # Vision-language endpoint
    model_id: nvidia/llama-3.2-90b-vision-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: ocrbench           # OCR evaluation
    - name: chartqa            # Chart understanding
    - name: mmmu               # Multimodal understanding
```

## When to Use vs Alternatives

**Use NeMo Evaluator when:**
- Need **100+ benchmarks** from 18+ harnesses in one platform
- Running evaluations on **Slurm HPC clusters** or cloud
- Requiring **reproducible** containerized evaluation
- Evaluating against **OpenAI-compatible APIs** (vLLM, TRT-LLM, NIMs)
- Need **enterprise-grade** evaluation with result export (MLflow, W&B)

**Use alternatives instead:**
- **lm-evaluation-harness**: Simpler setup for quick local evaluation
- **bigcode-evaluation-harness**: Focused only on code benchmarks
- **HELM**: Stanford's broader evaluation (fairness, efficiency)
- **Custom scripts**: Highly specialized domain evaluation

## Supported Harnesses and Tasks

| Harness | Task Count | Categories |
|---------|-----------|------------|
| `lm-evaluation-harness` | 60+ | MMLU, GSM8K, HellaSwag, ARC |
| `simple-evals` | 20+ | GPQA, MATH, AIME |
| `bigcode-evaluation-harness` | 25+ | HumanEval, MBPP, MultiPL-E |
| `safety-harness` | 3 | Aegis, WildGuard |
| `garak` | 1 | Security probing |
| `vlmevalkit` | 6+ | OCRBench, ChartQA, MMMU |
| `bfcl` | 6 | Function calling v2/v3 |
| `mtbench` | 2 | Multi-turn conversation |
| `livecodebench` | 10+ | Live coding evaluation |
| `helm` | 15 | Medical domain |
| `nemo-skills` | 8 | Math, science, agentic |

## Common Issues

**Issue: Container pull fails**

Ensure NGC credentials are configured:
```bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
```

**Issue: Task requires environment variable**

Some tasks need HF_TOKEN or JUDGE_API_KEY:
```yaml
evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # Maps env var name to env var
```

**Issue: Evaluation timeout**

Increase parallelism or reduce samples:
```bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
```

**Issue: Slurm job not starting**

Check Slurm account and partition:
```yaml
execution:
  account: correct_account
  partition: gpu
  qos: normal  # May need specific QOS
```

**Issue: Different results than expected**

Verify configuration matches reported settings:
```yaml
evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # Deterministic
        num_fewshot: 5    # Check paper's fewshot count
```

## CLI Reference

| Command | Description |
|---------|-------------|
| `run` | Execute evaluation with config |
| `status <id>` | Check job status |
| `info <id>` | View detailed job info |
| `ls tasks` | List available benchmarks |
| `ls runs` | List all invocations |
| `export <id>` | Export results (mlflow/wandb/local) |
| `kill <id>` | Terminate running job |

## Configuration Override Examples

```bash
# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel

# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
```

## Python API Usage

For programmatic evaluation without the CLI:

```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=4
    )
)

# Configure target endpoint
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key="nvapi-your-key-here"
    )
)

# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
```

## Advanced Topics

**Multi-backend execution**: See [references/execution-backends.md](references/execution-backends.md)
**Configuration deep-dive**: See [references/configuration.md](references/configuration.md)
**Adapter and interceptor system**: See [references/adapter-system.md](references/adapter-system.md)
**Custom benchmark integration**: See [references/custom-benchmarks.md](references/custom-benchmarks.md)

## Requirements

- **Python**: 3.10-3.13
- **Docker**: Required for local execution
- **NGC API Key**: For pulling containers and using NVIDIA Build
- **HF_TOKEN**: Required for some benchmarks (GPQA, MMLU)

## Resources

- **GitHub**: https://github.com/NVIDIA-NeMo/Evaluator
- **NGC Containers**: nvcr.io/nvidia/eval-factory/
- **NVIDIA Build**: https://build.nvidia.com (free hosted models)
- **Documentation**: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs

Overview

This skill evaluates large language models across 100+ benchmarks from 18+ harnesses using a container-first, reproducible platform. It supports multi-backend execution (local Docker, Slurm HPC, cloud) and exports results to MLflow, W&B or local formats for enterprise-grade benchmarking. Use it to run standard academic tasks, safety checks, vision-language tests, and large-scale comparisons across models.

How this skill works

The evaluator packages benchmarks and execution logic into containers and runs them against any OpenAI-compatible endpoint or self-hosted API (vLLM, TRT-LLM, NIMs). It orchestrates tasks, manages environment variables, and collects artifacts and metrics per invocation. Multi-backend support lets you run on local Docker, Slurm clusters, or cloud services while preserving reproducibility and exportable results.

When to use it

You need a single platform covering 100+ benchmarks (MMLU, GSM8K, HumanEval, safety, VLM).
Benchmark models on Slurm or other HPC with reproducible containerized jobs.
Compare multiple models on identical tasks and export results to MLflow or W&B.
Run safety assessments and vision-language evaluations alongside standard NLP benchmarks.
Integrate reproducible evaluations into CI or research pipelines.

Best practices

Containerize deployments and ensure NGC Docker credentials are configured for enterprise images.
Provide required env vars per task (HF_TOKEN, JUDGE_API_KEY) in the evaluation config to avoid runtime failures.
Start with limit_samples or lower parallelism when testing configs; increase parallelism for full runs.
Use deterministic settings (temperature=0.0, fixed seeds, num_fewshot matching papers) when comparing models.
Export runs to a common store (MLflow/W&B/local JSON) to simplify side-by-side comparison.

Example use cases

Run MMLU, GSM8K, Humaneval on a hosted OpenAI-compatible endpoint to establish baseline scores.
Launch a Slurm job to evaluate Llama-3.1 across 1k+ samples using 8 GPUs and export metrics to MLflow.
Compare two models by running a base config twice with different target.api_endpoint overrides and export both results to W&B.
Execute safety and VLM harnesses to validate content moderation and multimodal capabilities before deployment.
Quick smoke test using limit_samples=10 and local Docker to validate config and env vars.

FAQ

What runtimes/backends are supported?

Local Docker, Slurm HPC, and cloud deployments (Lepton/NGC). The same configs work across backends for reproducibility.

What do I need to pull containers?

An NGC API key and docker login to nvcr.io are required to pull enterprise images; local self-hosted endpoints can run without NGC keys.

How do I limit run time for quick tests?

Override evaluation.nemo_evaluator_config.config.params.limit_samples and parallelism via CLI -o flags to reduce sample count and speed up runs.