evaluation

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).
SKILL.md preview

---
name: evaluation
description: Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).
license: Apache-2.0
# Based on nel-assistant skill from NeMo Evaluator Launcher (commit f1fa073)
# https://github.com/NVIDIA-NeMo/Evaluator/tree/f1fa073/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant
# Modifications: renamed to evaluation, added workspace management (Step 0),
# auto-detect ModelOpt quantization format, quantization-aware benchmark defaults.
---

## NeMo Evaluator Launcher Assistant

You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.

### Workspace and Pipeline Integration

If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.

This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.

### Workflow

```text
Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation
```

**Step 1: Check prerequisites**

Test that `nel` is installed with `nel --version`. If not, instruct the user to `pip install nemo-evaluator-launcher`.

If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running.

**Shortcut: use pre-built task snippets.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching task snippet. Available: mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. Task snippets contain only the task-specific config (name, params, repeats) — not the full NEL config. To use them:

1. Read the task snippet(s) the user wants
2. Use `recipes/examples/example_eval.yaml` as the base config template
3. Replace the `tasks:` section with the selected snippet(s)
4. Do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in `???` values)
5. Proceed to Step 7.5/8

**Step 2: Build the base config file**

Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:

1. Execution:

- Local
- SLURM

2. Deployment:

- None (External)
- vLLM
- SGLang
- NIM
- TRT-LLM

3. Auto-export:

- None (auto-export disabled)
- MLflow
- wandb

4. Model type

- Base
- Chat
- Reasoning

5. Benchmarks:
  Allow for multiple choices in this question.
1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
3. Math & Reasoning (like AIME, GPQA, MATH-500, ...)
4. Safety & Security (like Garak and Safety Harness)
5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)

Only accept options from the categori

…