perf-memory-tuning
General↓ 0 installsUpdated 19d ago
Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
SKILL.md preview
--- name: perf-memory-tuning description: Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes. when_to_use: GPU OOM errors, reducing peak memory, or tracing an OOM regression to a specific commit or config change; 'out of memory', 'OOM', 'memory fragmentation', 'expandable_segments', 'reduce GPU memory', 'PYTORCH_CUDA_ALLOC_CONF'. --- # Memory Tuning Stable docs: @docs/parallelisms.md Card: @skills/perf-memory-tuning/card.yaml ## What It Is GPU OOM failures during training often stem from memory **fragmentation** rather than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is: ```bash export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ``` This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes. Beyond fragmentation, actual peak memory is determined by: - **Parameter + optimizer state memory** — controlled by TP, PP, DP sharding (distributed optimizer, FSDP) - **Activation memory** — controlled by activation recompute, sequence length, micro-batch size - **Temporary / workspace memory** — CUDA kernels, NCCL buffers, CUDA graphs ## Quick Decision When a training run OOMs or is close to the memory limit: 1. **Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` first.** This fixes fragmentation-induced OOM with zero performance cost. Most Slurm launch templates already include it. 2. **Add selective activation recompute** (`recompute_modules=[core_attn]`) if not already enabled. See @skills/perf-activation-recompute/SKILL.md. 3. **Avoid increasing TP** as a memory fix — doubling TP dramatically increases NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B). 4. **Avoid increasing PP at the cost of DP** — halving DP doubles gradient accumulation steps and hurts throughput (~6%). 5. Consider `mlp` recompute if still OOM. Saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B). 6. CPU offloading is **blocked when PP > 1**. ## Enablement ### Expandable segments (recommended first step) Set in the job's environment before launching: ```bash export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ``` In Slurm scripts this is typically placed alongside other env vars: ```bash export CUDA_DEVICE_MAX_CONNECTIONS=1 export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ``` No model config changes needed. Zero throughput cost. ### Parallelism resizing If the model genuinely does not fit (not fragmentation), adjust parallelism: | Strategy | Memory effect | Throughput cost | Notes | |---|---|---|---| | Increase PP (keeping DP) | Fewer layers per stage | Moderate (~6% if DP halved) | Only if GPU count allows | | Increase TP | Fewer params per GPU | Severe (-28% on 70B) | Last resort | | Distributed optimizer | Shards optimizer state across DP ranks | ~1-2% | Recommended for large models | | FSDP | Shards params + grads + optimizer | Varies | See @skills/perf-megatron-fsdp/SKILL.md | ### Activation recompute See @skills/perf-activation-recompute/SKILL.md for full details. ### CPU offloading ```python cfg.model.cpu_offloading = True ``` **Incompatible with PP > 1.** Only usable when `pipeline_model_parallel_size = 1`. ## A Note on VPP Virtual pipeline parallelism (VPP) is primarily a **throughput** optimization that reduces pipeline bubble overhead by interleaving smaller model chunks. Its effect on peak memory is minimal — changing VPP does not meaningfully change the total activation, parameter, or optimizer memory on a GPU. In earlier experiments we incorrectly attributed an OOM fix to VPP tuning (VPP 5→10). The actual fix was `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` w …