perf-moe-long-context

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
SKILL.md preview

---
name: perf-moe-long-context
description: Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
when_to_use: Training MoE at long sequence lengths, or investigating a commit that caused long-context MoE OOM or degraded throughput; 'long context MoE', '128k tokens', 'CP sizing for long sequences', 'selective recompute long context', 'MoE long-context OOM'.
---

# MoE Long-Context Training

Stable docs: @docs/training/moe-optimization.md
Card: @skills/perf-moe-long-context/card.yaml

## What Changes At Long Context

Once sequence length moves well past the 4K-class regime, attention memory and
activation residency become the dominant constraints. For MoE models, that
usually means you need some combination of:

- context parallelism
- selective recompute
- lower precision
- CPU offload for optimizer state
- a dispatcher and PP layout that do not waste the smaller remaining DP budget

## Rounded Scaling Patterns

### DSV3 on H100

The DSV3 long-context runs show a stable pattern:

- selective recompute works better than full recompute once you move past the
  shortest contexts
- throughput stays in a fairly narrow band from mid-length through very long
  contexts if CP is increased appropriately
- the trade shifts from "memory fit" to "GPU-count feasibility" as CP grows

In other words, long context does not immediately collapse utilization if the
layout is chosen well, but it does consume the DP budget very quickly.

### Qwen3-Next on GB200

Qwen3-Next behaves more like a memory-sensitive medium-scale model:

- 8K and 32K remain practical with moderate CP
- 64K is possible, but the throughput drop is noticeable and memory becomes
  much tighter
- pipeline layout and grouped-GEMM improvements matter almost as much as CP

### Qwen3 235B on GB200

Qwen3 235B shows that long context can still be efficient on NVL72 systems when
TP, CP, and HybridEP are coordinated. The best 128K-class configurations are
not just "fit-only" recipes; they can remain highly efficient if routing,
parallelism, and recompute are balanced.

## CP Sizing Rules Of Thumb

1. **Start from a 4K shard target**: a good first guess is
   `CP ~= seq_len / 4096`, then round to a practical power-of-two layout.

2. **Keep DP alive if possible**: long-context scaling becomes brittle once CP,
   EP, TP, and PP together squeeze DP down to the floor.

3. **Prefer selective recompute**: recompute modules such as `up_proj`, `norm`,
   `moe`, `moe_act`, or `mlp` before reaching for full recompute.

4. **Avoid SDPA-heavy recompute at very long context**: recomputing attention
   internals can add a lot of work for less memory benefit than recomputing
   smaller MoE and MLP-side modules.

5. **Use TP as another lever on NVL72 systems**: GB200 and GB300 runs can
   sometimes trade some CP for TP while still staying efficient.

6. **Assume GBS will need to shrink**: as CP rises and DP falls, you may need
   to reduce global batch size or accept higher GA.

## Representative Config Families

### DSV3 at 128K on H100

```text
TP=1  CP=32  EP=32  PP=8  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
```

### DSV3 at 256K on H100

```text
TP=1  CP=64  EP=32  PP=8  EDP=2  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
```

### Qwen3 235B at 128K on GB200

```text
TP=4  CP=4  EP=32  PP=4  VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocess
```

## Recompute And CUDA Graph Guidance

For long-context MoE training:

- start with selective recompute
- add CUDA graphs only after the shapes and routing path are stable
- keep sequence length and MBS fixed when using CUDA graphs
- if the run depends on highly dynamic batches, prefer

…