perf-expert-parallel-overlap

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
SKILL.md preview

---
name: perf-expert-parallel-overlap
description: Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
when_to_use: Enabling EP overlap to hide dispatch/combine latency, or tracing a throughput regression to an EP overlap config change; 'overlap_moe_expert_parallel_comm', 'delay_wgrad_compute', 'flex dispatcher', 'DeepEP overlap', 'HybridEP overlap'.
---

# MoE Expert-Parallel Overlap Skill

Stable docs: @docs/training/communication-overlap.md
Card: @skills/perf-expert-parallel-overlap/card.yaml

## References

- Stable docs: @docs/training/communication-overlap.md
- Structured metadata: @skills/perf-expert-parallel-overlap/card.yaml

## What It Is

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all
communication by running it concurrently with expert FFN compute. Optionally,
delayed expert weight-gradient computation (`delay_wgrad_compute`) provides
additional overlap by deferring wgrad to overlap with the next layer's forward.

Bridge supports two dispatcher paths:

| Dispatcher | Backend | When to use |
|---|---|---|
| `alltoall` | Standard MoE all-to-all | Default, broadest compatibility |
| `flex` | DeepEP or HybridEP | Higher overlap on Ampere/Hopper/Blackwell |

## Quick Decision

Use EP overlap when:

- the model is MoE with `EP > 1`
- expert dispatch/combine communication is a meaningful part of step time
- you have memory headroom and are tuning for throughput

Prefer:

- `alltoall` dispatcher for the first rollout (broader compatibility)
- `flex` + DeepEP/HybridEP when running on supported GPUs and seeking
  additional gains

Avoid EP overlap when:

- full activation recompute is enabled
- `moe_shared_expert_overlap` is enabled
- the run is still being brought up for correctness
- PyTorch < 2.6.0

Expected outcome:

- if all-to-all dispatch is a clear profile bottleneck, overlap can produce a
  modest to meaningful speedup
- if the run is tiny, communication-light, or dominated by another wall, the
  gain may be negligible

## Enablement

### alltoall dispatcher

```python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False
```

### flex dispatcher (DeepEP or HybridEP)

```python
from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")
```

## Compatibility And Constraints

- `expert_model_parallel_size > 1`
- `num_moe_experts > 1`
- `moe_token_dispatcher_type` must be `"alltoall"` or `"flex"`
- `moe_shared_expert_overlap = False`
- Base precision is BF16 or FP16
- PyTorch `>= 2.6.0`
- If `PP > 1`, `virtual_pipeline_model_parallel_size` must be set
- `recompute_granularity != "full"`, `recompute_method = None`,
  `recompute_num_layers = None`
- `mtp_num_layers` must be `None` or `1`
- `delay_wgrad_compute` requires `overlap_moe_expert_parallel_comm` as a
  prerequisite
- `delay_wgrad_compute` with `overlap_grad_reduce` requires TE >= 2.7.0
- `delay_wgrad_compute` with `gradient_accumulation_fusion` requires TE >= 2.7.0
- CUDA graph `attn` scope + `delay_wgrad_compute` requires TE >= 2.12.0,
  `gradient_accumulation_fusion = True`, and no attention bias
- DeepEP: Ampere, Hopper, B200, B300 GPUs only
- HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72

## Minimal Working Config

```python
cfg.comm_overlap.overlap_moe_expert_parallel_c

…