multi-node-slurm

General↓ 0 installsUpdated 65d ago
VerifiedCuratedNVIDIA
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
SKILL.md preview

---
name: multi-node-slurm
description: Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
when_to_use: Writing or converting Slurm sbatch scripts, scaling to multiple nodes, debugging NCCL/launch failures, or investigating a commit that caused multi-node training failures; 'run on multiple nodes', 'sbatch script', 'NCCL timeout', 'multi-node OOM'.
---

# Multi-Node Slurm

Convert single-node `uv run python -m torch.distributed.run` commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.

## Two Approaches: srun-native vs uv run torch.distributed

| Approach | `ntasks-per-node` | Process spawning | Best for |
|---|---|---|---|
| **srun-native** (preferred) | 8 | Slurm spawns 8 tasks/node | Conversion, inference, Bridge scripts |
| **uv run torch.distributed** (legacy) | 1 | `uv run python -m torch.distributed.run` spawns 8 procs/node | MLM pretrain_gpt.py |

**Prefer srun-native** — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives `RANK`, `WORLD_SIZE`, `LOCAL_RANK`, `MASTER_ADDR`, `MASTER_PORT` from SLURM env vars (`SLURM_PROCID`, `SLURM_NTASKS`, `SLURM_LOCALID`, `SLURM_NODELIST`) via `common_utils.py` helpers called during `initialize.py` distributed init, so you never need to set them manually.

## Cluster Environment

### Container

```bash
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data"
```

### Standard Paths

```bash
WORKDIR="/opt/Megatron-Bridge"
DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document"
```

### Tokens / Caches

```bash
export GH_TOKEN=<YOUR_GITHUB_TOKEN>
export HF_TOKEN=<YOUR_HF_TOKEN>
export HF_HOME=<SHARED_FS>/HF_HOME
export UV_CACHE_DIR="<SHARED_FS>/uv_cache"
export NEMO_HOME="<SHARED_FS>/cache/nemo"
```

**Important**: `NEMO_HOME` must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs.
The default (`/root/.cache/nemo`) is container-local and not shared across nodes.
Without this, packed-sequence data files prepared on node 0 are invisible to other
nodes, causing `TypeError: 'NoneType' object is not an iterator`.

### Log Directory

```text
<SHARED_FS>/logs/<job_name>_<suffix>
```

## srun-native Approach (Preferred)

Slurm spawns all processes directly. No `torch.distributed.run`, no TRAIN_CMD escaping.

### SBATCH Headers

```bash
#SBATCH --job-name=<model>-<task>
#SBATCH --nodes=<NNODES>
#SBATCH --ntasks-per-node=8          # Slurm spawns 8 tasks per node
#SBATCH --gpus-per-node=8
#SBATCH --time=00:30:00
#SBATCH --account=<YOUR_ACCOUNT>
#SBATCH --partition=batch
#SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log
#SBATCH --exclusive
```

### Build and Launch

Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun.

```bash
# Env exports at sbatch level (before srun)
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0

# Phase 1: Single-process uv sync to build/populate the shared cache
srun --mpi=pmix -N 1 --ntasks=1 \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd $WORKDIR && uv sync"

# Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm)
srun --mpi=pmix \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>"
```

### srun-native Key Points

- Phase 1 runs `uv sync` once on a single node/process, building all wheels into the shared cache on Lustre
- Phase 2's `uv sync` is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards
- `initialize.py` + 

…