perf-memory-tuning

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
SKILL.md preview

---
name: perf-memory-tuning
description: Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
when_to_use: GPU OOM errors, reducing peak memory, or tracing an OOM regression to a specific commit or config change; 'out of memory', 'OOM', 'memory fragmentation', 'expandable_segments', 'reduce GPU memory', 'PYTORCH_CUDA_ALLOC_CONF'.
---

# Memory Tuning

Stable docs: @docs/parallelisms.md
Card: @skills/perf-memory-tuning/card.yaml

## What It Is

GPU OOM failures during training often stem from memory **fragmentation** rather
than raw capacity.  PyTorch's default CUDA allocator can leave unusable gaps
between allocations.  The single most effective fix is:

```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

This tells PyTorch to use expandable (non-fixed-size) memory segments, which
dramatically reduces fragmentation and often eliminates borderline OOM without
any model or parallelism changes.

Beyond fragmentation, actual peak memory is determined by:

- **Parameter + optimizer state memory** — controlled by TP, PP, DP sharding
  (distributed optimizer, FSDP)
- **Activation memory** — controlled by activation recompute, sequence length,
  micro-batch size
- **Temporary / workspace memory** — CUDA kernels, NCCL buffers, CUDA graphs

## Quick Decision

When a training run OOMs or is close to the memory limit:

1. **Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` first.** This fixes
   fragmentation-induced OOM with zero performance cost. Most Slurm launch
   templates already include it.
2. **Add selective activation recompute** (`recompute_modules=[core_attn]`) if
   not already enabled. See @skills/perf-activation-recompute/SKILL.md.
3. **Avoid increasing TP** as a memory fix — doubling TP dramatically increases
   NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B).
4. **Avoid increasing PP at the cost of DP** — halving DP doubles gradient
   accumulation steps and hurts throughput (~6%).
5. Consider `mlp` recompute if still OOM. Saves ~3 GB but costs ~16% GPU
   utilization on large dense models (Llama3 70B).
6. CPU offloading is **blocked when PP > 1**.

## Enablement

### Expandable segments (recommended first step)

Set in the job's environment before launching:

```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

In Slurm scripts this is typically placed alongside other env vars:

```bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

No model config changes needed. Zero throughput cost.

### Parallelism resizing

If the model genuinely does not fit (not fragmentation), adjust parallelism:

| Strategy | Memory effect | Throughput cost | Notes |
|---|---|---|---|
| Increase PP (keeping DP) | Fewer layers per stage | Moderate (~6% if DP halved) | Only if GPU count allows |
| Increase TP | Fewer params per GPU | Severe (-28% on 70B) | Last resort |
| Distributed optimizer | Shards optimizer state across DP ranks | ~1-2% | Recommended for large models |
| FSDP | Shards params + grads + optimizer | Varies | See @skills/perf-megatron-fsdp/SKILL.md |

### Activation recompute

See @skills/perf-activation-recompute/SKILL.md for full details.

### CPU offloading

```python
cfg.model.cpu_offloading = True
```

**Incompatible with PP > 1.** Only usable when `pipeline_model_parallel_size = 1`.

## A Note on VPP

Virtual pipeline parallelism (VPP) is primarily a **throughput** optimization
that reduces pipeline bubble overhead by interleaving smaller model chunks. Its
effect on peak memory is minimal — changing VPP does not meaningfully change
the total activation, parameter, or optimizer memory on a GPU.

In earlier experiments we incorrectly attributed an OOM fix to VPP tuning
(VPP 5→10). The actual fix was `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
w

…