NeMo Evaluator SDK - Enterprise LLM Benchmarking

## Quick Start

25 stars

Best use case

NeMo Evaluator SDK - Enterprise LLM Benchmarking is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Quick Start

Teams using NeMo Evaluator SDK - Enterprise LLM Benchmarking should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nemo-evaluator/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/Orchestra-Research/AI-Research-SKILLs/nemo-evaluator/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/nemo-evaluator/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How NeMo Evaluator SDK - Enterprise LLM Benchmarking Compares

Feature / Agent	NeMo Evaluator SDK - Enterprise LLM Benchmarking	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

## Quick Start

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# NeMo Evaluator SDK - Enterprise LLM Benchmarking

## Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

**Installation**:
```bash
pip install nemo-evaluator-launcher
```

**Set API key and run evaluation**:
```bash
export NGC_API_KEY=nvapi-your-key-here

# Create minimal config
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
```

**View available tasks**:
```bash
nemo-evaluator-launcher ls tasks
```

## Common Workflows

### Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

**Checklist**:
```
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
```

**Step 1: Configure API endpoint**

```yaml
# config.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY
```

For self-hosted endpoints (vLLM, TRT-LLM):
```yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local
```

**Step 2: Select benchmarks**

Add tasks to your config:
```yaml
evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation
```

**Step 3: Run evaluation**

```bash
# Run with config file
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config

# Override output directory
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o execution.output_dir=./my_results

# Limit samples for quick testing
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
```

**Step 4: Check results**

```bash
# Check job status
nemo-evaluator-launcher status <invocation_id>

# List all runs
nemo-evaluator-launcher ls runs

# View results
cat results/<invocation_id>/<task>/artifacts/results.yml
```

### Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

**Checklist**:
```
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
```

**Step 1: Configure Slurm settings**

```yaml
# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8
```

**Step 2: Set up model deployment**

```yaml
deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment
```

**Step 3: Launch evaluation**

```bash
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config
```

**Step 4: Monitor job status**

```bash
# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>

# View detailed info
nemo-evaluator-launcher info <invocation_id>

# Kill if needed
nemo-evaluator-launcher kill <invocation_id>
```

### Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

**Checklist**:
```
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
```

**Step 1: Create base config**

```yaml
# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval
```

**Step 2: Run evaluations with model overrides**

```bash
# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# Evaluate Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
```

**Step 3: Export and compare**

```bash
# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
```

### Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

**Checklist**:
```
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
```

**Step 1: Configure safety tasks**

```yaml
evaluation:
  tasks:
    - name: aegis              # Safety harness
    - name: wildguard          # Safety classification
    - name: garak              # Security probing
```

**Step 2: Configure VLM tasks**

```yaml
# For vision-language models
target:
  api_endpoint:
    type: vlm  # Vision-language endpoint
    model_id: nvidia/llama-3.2-90b-vision-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: ocrbench           # OCR evaluation
    - name: chartqa            # Chart understanding
    - name: mmmu               # Multimodal understanding
```

## When to Use vs Alternatives

**Use NeMo Evaluator when:**
- Need **100+ benchmarks** from 18+ harnesses in one platform
- Running evaluations on **Slurm HPC clusters** or cloud
- Requiring **reproducible** containerized evaluation
- Evaluating against **OpenAI-compatible APIs** (vLLM, TRT-LLM, NIMs)
- Need **enterprise-grade** evaluation with result export (MLflow, W&B)

**Use alternatives instead:**
- **lm-evaluation-harness**: Simpler setup for quick local evaluation
- **bigcode-evaluation-harness**: Focused only on code benchmarks
- **HELM**: Stanford's broader evaluation (fairness, efficiency)
- **Custom scripts**: Highly specialized domain evaluation

## Supported Harnesses and Tasks

| Harness | Task Count | Categories |
|---------|-----------|------------|
| `lm-evaluation-harness` | 60+ | MMLU, GSM8K, HellaSwag, ARC |
| `simple-evals` | 20+ | GPQA, MATH, AIME |
| `bigcode-evaluation-harness` | 25+ | HumanEval, MBPP, MultiPL-E |
| `safety-harness` | 3 | Aegis, WildGuard |
| `garak` | 1 | Security probing |
| `vlmevalkit` | 6+ | OCRBench, ChartQA, MMMU |
| `bfcl` | 6 | Function calling v2/v3 |
| `mtbench` | 2 | Multi-turn conversation |
| `livecodebench` | 10+ | Live coding evaluation |
| `helm` | 15 | Medical domain |
| `nemo-skills` | 8 | Math, science, agentic |

## Common Issues

**Issue: Container pull fails**

Ensure NGC credentials are configured:
```bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
```

**Issue: Task requires environment variable**

Some tasks need HF_TOKEN or JUDGE_API_KEY:
```yaml
evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # Maps env var name to env var
```

**Issue: Evaluation timeout**

Increase parallelism or reduce samples:
```bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
```

**Issue: Slurm job not starting**

Check Slurm account and partition:
```yaml
execution:
  account: correct_account
  partition: gpu
  qos: normal  # May need specific QOS
```

**Issue: Different results than expected**

Verify configuration matches reported settings:
```yaml
evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # Deterministic
        num_fewshot: 5    # Check paper's fewshot count
```

## CLI Reference

| Command | Description |
|---------|-------------|
| `run` | Execute evaluation with config |
| `status <id>` | Check job status |
| `info <id>` | View detailed job info |
| `ls tasks` | List available benchmarks |
| `ls runs` | List all invocations |
| `export <id>` | Export results (mlflow/wandb/local) |
| `kill <id>` | Terminate running job |

## Configuration Override Examples

```bash
# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel

# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
```

## Python API Usage

For programmatic evaluation without the CLI:

```python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=4
    )
)

# Configure target endpoint
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key="nvapi-your-key-here"
    )
)

# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
```

## Advanced Topics

**Multi-backend execution**: See [references/execution-backends.md](references/execution-backends.md)
**Configuration deep-dive**: See [references/configuration.md](references/configuration.md)
**Adapter and interceptor system**: See [references/adapter-system.md](references/adapter-system.md)
**Custom benchmark integration**: See [references/custom-benchmarks.md](references/custom-benchmarks.md)

## Requirements

- **Python**: 3.10-3.13
- **Docker**: Required for local execution
- **NGC API Key**: For pulling containers and using NVIDIA Build
- **HF_TOKEN**: Required for some benchmarks (GPQA, MMLU)

## Resources

- **GitHub**: https://github.com/NVIDIA-NeMo/Evaluator
- **NGC Containers**: nvcr.io/nvidia/eval-factory/
- **NVIDIA Build**: https://build.nvidia.com (free hosted models)
- **Documentation**: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs

Related Skills

exa-enterprise-rbac

from ComeOnOliver/skillshub

Manage Exa API key scoping, team access controls, and domain restrictions. Use when implementing multi-key access control, configuring per-team search limits, or setting up organization-level Exa governance. Trigger with phrases like "exa access control", "exa RBAC", "exa enterprise", "exa team keys", "exa permissions".

evernote-enterprise-rbac

from ComeOnOliver/skillshub

Implement enterprise RBAC for Evernote integrations. Use when building multi-tenant systems, implementing role-based access, or handling business accounts. Trigger with phrases like "evernote enterprise", "evernote rbac", "evernote business", "evernote permissions".

documenso-enterprise-rbac

from ComeOnOliver/skillshub

Configure Documenso enterprise role-based access control and team management. Use when implementing team permissions, configuring organizational roles, or setting up enterprise access controls. Trigger with phrases like "documenso RBAC", "documenso teams", "documenso permissions", "documenso enterprise", "documenso roles".

deepgram-enterprise-rbac

from ComeOnOliver/skillshub

Configure enterprise role-based access control for Deepgram integrations. Use when implementing team permissions, managing API key scopes, or setting up organization-level access controls. Trigger: "deepgram RBAC", "deepgram permissions", "deepgram access control", "deepgram team roles", "deepgram enterprise", "deepgram key scopes".

databricks-enterprise-rbac

from ComeOnOliver/skillshub

Configure Databricks enterprise SSO, Unity Catalog RBAC, and organization management. Use when implementing SSO integration, configuring role-based permissions, or setting up organization-level controls with Unity Catalog. Trigger with phrases like "databricks SSO", "databricks RBAC", "databricks enterprise", "unity catalog permissions", "databricks SCIM".

coreweave-enterprise-rbac

from ComeOnOliver/skillshub

Configure RBAC and namespace isolation for CoreWeave multi-team GPU access. Use when managing team permissions, isolating GPU quotas, or implementing namespace-level access control. Trigger with phrases like "coreweave rbac", "coreweave permissions", "coreweave namespace isolation", "coreweave team access".

cohere-enterprise-rbac

from ComeOnOliver/skillshub

Configure Cohere enterprise API key management, role-based access, and org controls. Use when implementing multi-team API key management, per-team usage limits, or setting up organization-level controls for Cohere. Trigger with phrases like "cohere enterprise", "cohere RBAC", "cohere team keys", "cohere org management", "cohere access control".

coderabbit-enterprise-rbac

from ComeOnOliver/skillshub

Configure CodeRabbit enterprise access control, seat management, and organization policies. Use when managing who gets AI reviews, configuring organization-level defaults, or implementing access policies for CodeRabbit across teams. Trigger with phrases like "coderabbit SSO", "coderabbit RBAC", "coderabbit enterprise", "coderabbit roles", "coderabbit permissions", "coderabbit seats".

clickup-enterprise-rbac

from ComeOnOliver/skillshub

Implement ClickUp Enterprise SSO, OAuth 2.0 multi-workspace access, role-based permissions, and organization management via API v2. Trigger: "clickup SSO", "clickup RBAC", "clickup enterprise", "clickup roles", "clickup permissions", "clickup OAuth app", "clickup multi-workspace".

clickhouse-enterprise-rbac

from ComeOnOliver/skillshub

Configure ClickHouse enterprise RBAC — SQL-based users, roles, row policies, column-level grants, and quota management. Use when setting up multi-user access control, implementing tenant isolation, or configuring enterprise security for ClickHouse. Trigger: "clickhouse RBAC", "clickhouse roles", "clickhouse permissions", "clickhouse row policy", "clickhouse enterprise access", "clickhouse GRANT".

clerk-enterprise-rbac

from ComeOnOliver/skillshub

Configure enterprise SSO, role-based access control, and organization management. Use when implementing SSO integration, configuring role-based permissions, or setting up organization-level controls. Trigger with phrases like "clerk SSO", "clerk RBAC", "clerk enterprise", "clerk roles", "clerk permissions", "clerk organizations".

clay-enterprise-rbac

from ComeOnOliver/skillshub

Configure Clay workspace roles, team access control, and credit budget allocation. Use when managing team access to Clay tables, setting per-user credit budgets, or configuring workspace-level permissions for Clay. Trigger with phrases like "clay SSO", "clay RBAC", "clay enterprise", "clay roles", "clay permissions", "clay team access", "clay workspace".