multi-node-slurm
General↓ 0 installsUpdated 19d ago
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
SKILL.md preview
--- name: multi-node-slurm description: Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. when_to_use: Writing or converting Slurm sbatch scripts, scaling to multiple nodes, debugging NCCL/launch failures, or investigating a commit that caused multi-node training failures; 'run on multiple nodes', 'sbatch script', 'NCCL timeout', 'multi-node OOM'. --- # Multi-Node Slurm Convert single-node `uv run python -m torch.distributed.run` commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures. ## Two Approaches: srun-native vs uv run torch.distributed | Approach | `ntasks-per-node` | Process spawning | Best for | |---|---|---|---| | **srun-native** (preferred) | 8 | Slurm spawns 8 tasks/node | Conversion, inference, Bridge scripts | | **uv run torch.distributed** (legacy) | 1 | `uv run python -m torch.distributed.run` spawns 8 procs/node | MLM pretrain_gpt.py | **Prefer srun-native** — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives `RANK`, `WORLD_SIZE`, `LOCAL_RANK`, `MASTER_ADDR`, `MASTER_PORT` from SLURM env vars (`SLURM_PROCID`, `SLURM_NTASKS`, `SLURM_LOCALID`, `SLURM_NODELIST`) via `common_utils.py` helpers called during `initialize.py` distributed init, so you never need to set them manually. ## Cluster Environment ### Container ```bash CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh" CONTAINER_MOUNTS="<SHARED_FS>:<SHARED_FS>,<PATH_TO_MEGATRON_BRIDGE>:/opt/Megatron-Bridge,<PATH_TO_DATA>:/opt/data" ``` ### Standard Paths ```bash WORKDIR="/opt/Megatron-Bridge" DATA_PATH="<PATH_TO_PREPROCESSED_DATA>/dclm_01_01_text_document" ``` ### Tokens / Caches ```bash export GH_TOKEN=<YOUR_GITHUB_TOKEN> export HF_TOKEN=<YOUR_HF_TOKEN> export HF_HOME=<SHARED_FS>/HF_HOME export UV_CACHE_DIR="<SHARED_FS>/uv_cache" export NEMO_HOME="<SHARED_FS>/cache/nemo" ``` **Important**: `NEMO_HOME` must point to a shared filesystem (e.g. Lustre) for multi-node SFT/PEFT jobs. The default (`/root/.cache/nemo`) is container-local and not shared across nodes. Without this, packed-sequence data files prepared on node 0 are invisible to other nodes, causing `TypeError: 'NoneType' object is not an iterator`. ### Log Directory ```text <SHARED_FS>/logs/<job_name>_<suffix> ``` ## srun-native Approach (Preferred) Slurm spawns all processes directly. No `torch.distributed.run`, no TRAIN_CMD escaping. ### SBATCH Headers ```bash #SBATCH --job-name=<model>-<task> #SBATCH --nodes=<NNODES> #SBATCH --ntasks-per-node=8 # Slurm spawns 8 tasks per node #SBATCH --gpus-per-node=8 #SBATCH --time=00:30:00 #SBATCH --account=<YOUR_ACCOUNT> #SBATCH --partition=batch #SBATCH --output=<SHARED_FS>/logs/<job_name>_%j.log #SBATCH --exclusive ``` ### Build and Launch Two-phase srun: first a single-process srun to populate the uv cache, then the full multi-node srun. ```bash # Env exports at sbatch level (before srun) export TORCH_NCCL_AVOID_RECORD_STREAMS=1 export NCCL_NVLS_ENABLE=0 # Phase 1: Single-process uv sync to build/populate the shared cache srun --mpi=pmix -N 1 --ntasks=1 \ --container-image="$CONTAINER_IMAGE" \ --container-mounts="$CONTAINER_MOUNTS" \ --no-container-mount-home \ bash -c "cd $WORKDIR && uv sync" # Phase 2: Full multi-node run (uv sync is a fast no-op since cache is warm) srun --mpi=pmix \ --container-image="$CONTAINER_IMAGE" \ --container-mounts="$CONTAINER_MOUNTS" \ --no-container-mount-home \ bash -c "cd $WORKDIR && uv sync && uv run --no-sync python <script.py> <args>" ``` ### srun-native Key Points - Phase 1 runs `uv sync` once on a single node/process, building all wheels into the shared cache on Lustre - Phase 2's `uv sync` is a fast no-op (everything is cached) — safe to run on all ranks without sleep guards - `initialize.py` + …