ptq

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
SKILL.md preview

---
name: ptq
description: This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
---

# ModelOpt Post-Training Quantization

Produce a quantized checkpoint from a pretrained model. **Read `examples/llm_ptq/README.md` first** — it has the support matrix, CLI flags, and accuracy guidance.

## Step 1 — Environment

Read `skills/common/environment-setup.md` and `skills/common/workspace-management.md`. After completing them you should know:

- ModelOpt source is available
- Local or remote (+ cluster config if remote)
- SLURM / Docker+GPU / bare GPU
- Launcher available?
- Which workspace to use

## Step 2 — Is the model supported?

Check the support table in `examples/llm_ptq/README.md` for verified HF models.

- **Listed** → supported, use `hf_ptq.py` (step 4A/4B)
- **Not listed** → read `references/unsupported-models.md` to determine if `hf_ptq.py` can still work or if a custom script is needed (step 4C)

## Step 2.5 — Check for model-specific dependencies

If the model uses `trust_remote_code` (check `config.json` for `auto_map`), inspect its custom Python files for imports not present in the container:

```bash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
```

**Known dependency patterns:**

| Import found | Packages to install |
| --- | --- |
| `from mamba_ssm` / `from causal_conv1d` | `mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba) |

If extra deps are needed:
- **Launcher (4B)**: set `EXTRA_PIP_DEPS` in the task's `environment` section — `ptq.sh` installs them automatically
- **Manual (4A)**: `unset PIP_CONSTRAINT && pip install <deps>` before running `hf_ptq.py`

## Step 3 — Choose quantization format

**First**, check for a model-specific recipe:

```bash
ls modelopt_recipes/models/ 2>/dev/null
```

If a model-specific recipe exists, use `--recipe <path>` — it may contain tuned settings.

**If no model-specific recipe**, choose a format based on GPU (details in `examples/llm_ptq/README.md`):

- **Blackwell** (B100/B200/GB200): `nvfp4` variants
- **Hopper** (H100/H200) or older: `fp8` or `int4_awq`

Use `--qformat <name>` (e.g., `--qformat nvfp4`). Format definitions: `modelopt/torch/quantization/config.py`. General PTQ recipes in `modelopt_recipes/general/ptq/` correspond to the same formats — `--qformat` is the simpler way to use them.

> NVFP4 can be calibrated on Hopper but requires Blackwell for inference.

## Step 4 — Run PTQ

**Goal: checkpoint on disk** (`.safetensors` + `config.json`).

For **listed models** (4A/4B): run full calibration directly (`--calib_size 512`).
For **unlisted models** (4C): run a smoke test first (`--calib_size 4`), wait for success, then full calibration.

### Which path?

```text
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
                  │          Local Docker + GPU? ────────→ LAUNCHER (4B)
                  │          Remote Docker (no SLURM)? ──→ MANUAL (4A)
                  │          Bare GPU (local or remote)? → MANUAL (4A)
                  │
                  └→ NOT LISTED ──→ UNLISTED MODEL (4C)
```

### 4A — Direct: supported model, manual execution

```bash
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>
```

Run `--help` for all options.

For remote: use `remote_run` from `remote_exec.sh` (see `skills/common/remote-execution.md`).

### 4B — Launcher: supported model on SLURM or local Docker

Write a YAML config using `common/hf_ptq/hf_ptq.sh`. See `references/launcher-guide.md` for the full template.

```bash
cd tools/launcher
# SLU

…