home / skills / eyadsibai / ltk / nemo-evaluator

This skill enables rapid LLM benchmarking with 100+ tasks, reproducible containerized evaluation, and multi-backend execution for scalable model assessment.

npx playbooks add skill eyadsibai/ltk --skill nemo-evaluator

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.7 KB
---
name: nemo-evaluator
description: Use when evaluating LLMs, running benchmarks like MMLU/HumanEval/GSM8K, setting up evaluation pipelines, or asking about "NeMo Evaluator", "LLM benchmarking", "model evaluation", "MMLU", "HumanEval", "GSM8K", "benchmark harnesses"
version: 1.0.0
---

# NeMo Evaluator SDK - Enterprise LLM Benchmarking

## Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

**Installation**:

```bash
pip install nemo-evaluator-launcher
```

**Basic evaluation**:

```bash
export NGC_API_KEY=nvapi-your-key-here

cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

nemo-evaluator-launcher run --config-dir . --config-name config
```

## Common Workflows

### Workflow 1: Standard Model Evaluation

**Checklist**:

```
- [ ] Configure API endpoint (NVIDIA Build or self-hosted)
- [ ] Select benchmarks (MMLU, GSM8K, IFEval, HumanEval)
- [ ] Run evaluation
- [ ] Check results
```

**Step 1: Configure endpoint**

For NVIDIA Build:

```yaml
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY
```

For self-hosted (vLLM, TRT-LLM):

```yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""
```

**Step 2: Select benchmarks**

```yaml
evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation
```

**Step 3: Run and check results**

```bash
nemo-evaluator-launcher run --config-dir . --config-name config
nemo-evaluator-launcher status <invocation_id>
cat results/<invocation_id>/<task>/artifacts/results.yml
```

### Workflow 2: Slurm HPC Evaluation

```yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
```

### Workflow 3: Model Comparison

```bash
# Same config, different models
nemo-evaluator-launcher run --config-dir . --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct

nemo-evaluator-launcher run --config-dir . --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3

# Export results
nemo-evaluator-launcher export <id> --dest mlflow
nemo-evaluator-launcher export <id> --dest wandb
```

## Supported Harnesses

| Harness | Tasks | Categories |
|---------|-------|------------|
| lm-evaluation-harness | 60+ | MMLU, GSM8K, HellaSwag, ARC |
| simple-evals | 20+ | GPQA, MATH, AIME |
| bigcode-evaluation-harness | 25+ | HumanEval, MBPP, MultiPL-E |
| safety-harness | 3 | Aegis, WildGuard |
| vlmevalkit | 6+ | OCRBench, ChartQA, MMMU |
| bfcl | 6 | Function calling v2/v3 |

## CLI Reference

| Command | Description |
|---------|-------------|
| `run` | Execute evaluation with config |
| `status <id>` | Check job status |
| `ls tasks` | List available benchmarks |
| `ls runs` | List all invocations |
| `export <id>` | Export results (mlflow/wandb/local) |
| `kill <id>` | Terminate running job |

## When to Use vs Alternatives

**Use NeMo Evaluator when:**

- Need 100+ benchmarks from 18+ harnesses
- Running on Slurm HPC clusters
- Requiring reproducible containerized evaluation
- Evaluating against OpenAI-compatible APIs

**Use alternatives instead:**

- **lm-evaluation-harness**: Simpler local evaluation
- **bigcode-evaluation-harness**: Code-only benchmarks
- **HELM**: Broader evaluation (fairness, efficiency)

## Common Issues

**Container pull fails**: Configure NGC credentials

```bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
```

**Task requires env var**: Add to task config

```yaml
tasks:
  - name: gpqa_diamond
    env_vars:
      HF_TOKEN: HF_TOKEN
```

**Increase parallelism**:

```bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
```

## Requirements

- Python 3.10-3.13
- Docker (for local execution)
- NGC API Key (for NVIDIA Build)
- HF_TOKEN (for some benchmarks)

Overview

This skill helps evaluate large language models across 100+ benchmarks from 18+ harnesses using a reproducible, containerized pipeline. It supports multi-backend execution (local Docker, Slurm HPC, cloud) and exports results to tracking systems like MLflow or Weights & Biases. Use it to run standard benchmarks (MMLU, GSM8K, HumanEval) or build repeatable evaluation workflows for model comparison.

How this skill works

You provide a YAML config that defines execution backend, target API or self-hosted endpoint, and a list of evaluation tasks. The launcher runs containerized harnesses, collects artifacts and results per invocation, and exposes CLI commands to run, monitor, list tasks/runs, export results, or kill jobs. Slurm and deployment options let you scale to HPC clusters and multi-GPU setups.

When to use it

  • Benchmark new or fine-tuned LLMs against a broad suite (MMLU, GSM8K, HumanEval).
  • Compare multiple models with the same evaluation config for apples-to-apples results.
  • Run reproducible evaluations on Slurm HPC or local Docker with containerized harnesses.
  • Automate evaluation pipelines and export metrics to MLflow or W&B for tracking.
  • Validate model behavior against instruction-following, reasoning, and code tasks.

Best practices

  • Keep a single source config and vary only target.model_id to compare models consistently.
  • Provide required env vars (HF_TOKEN, NGC_API_KEY) per task to avoid missing-data failures.
  • Pin limits and parallelism in config for predictable resource usage and repeatable runtimes.
  • Use container registries credentials (nvcr.io) and check network permissions before large runs.
  • Export results to a tracking backend immediately after runs for long-term storage and analysis.

Example use cases

  • Run a baseline evaluation: MMLU + GSM8K + HumanEval on a hosted API endpoint to measure general capabilities.
  • Scale a math and code benchmark sweep on Slurm with vLLM deployment across multiple GPUs.
  • Compare two model checkpoints with identical evaluation config and export results to MLflow.
  • Integrate into CI: run a lightweight selection of benchmarks on every model push to catch regressions.

FAQ

What backends are supported?

Local Docker, Slurm HPC, and cloud via the provided Lepton-like execution options; self-hosted endpoints (vLLM, TRT-LLM) and NVIDIA Build APIs are supported.

How do I export results to tracking tools?

Use the export CLI: nemo-evaluator-launcher export <invocation_id> --dest mlflow or --dest wandb, or save artifacts locally from results/<invocation_id>.