live1,247 agents deployedbuilt by a solo devpowered by hermes
← All skillsSign up to install

nightly-sync

General0 installsUpdated 19d ago
VerifiedCuratedNVIDIA

Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.

SKILL.md preview

---
name: nightly-sync
description: Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
when_to_use: Working on the nightly sync PR; investigating a nightly sync failure; resolving merge conflicts between main and dev; 'nightly sync failed', 'main-to-dev merge', 'sync bot'.
---

# Nightly Sync: Main to Dev

This skill is read by the automated sync bot during the nightly-sync-main-to-dev
workflow. It contains all domain knowledge for merging main into dev, resolving
conflicts, iterating on CI, and shipping the PR.

---

## Phase 1: Create the Sync Branch and Merge

### Branch Setup

1. Create branch `$BRANCH` from `origin/dev`
2. Merge: `git merge origin/main -X theirs --no-edit`
3. If conflicts remain (e.g. add/add), resolve by favoring main

### Preserving Dev-Only Additions

Do NOT blanket-override all shared files with main's version. Dev has features
not yet in main (new classes, new modules, new tests). The merge preserves both
sides' non-conflicting additions — only intervene where there is an actual
conflict.

### Squash-Merge Chain Detection

Dev often develops features as a chain of PRs (PR1 → PR2 → PR3) where each
builds on the last. When PR1 is squash-merged to main, git sees main's squashed
version and dev's original commits as unrelated changes. `-X theirs` will pick
main's PR1 code and silently discard PR2/PR3's improvements on dev.

After the merge, check for this pattern:

1. For each file where `-X theirs` resolved a conflict, run
   `git log --oneline origin/dev -- <file>` to see if dev has commits that
   came AFTER the code main is bringing in.
2. If dev has follow-up commits (bug fixes, refactors, extensions), **favor
   dev's version** for those sections.
3. If the conflict is just main bringing in a clean copy of what dev already
   has (no follow-ups), main's version is fine.

Practical check: run `git diff origin/dev -- <file>` on conflicted files. If
dev's code was removed or reverted, investigate whether dev's version is the
more evolved one.

Real examples from PR #4291:
- `emerging_optimizers.py`: Main's version was MORE complete — it squash-merged
  dev's PRs plus added more. `-X theirs` was correct.
- `distrib_optimizer.py`: Main overwrote dev's `GroupedQuantizedTensor` support.
  Had to restore `_is_distopt_quantized_param` and the expanded
  `_expand_quantized_param_shard_for_cast` loop while keeping main's NVFP4
  additions. This required a surgical merge combining sections from both.

Key insight: squash-merge chains can go in EITHER direction. Sometimes main
is ahead (it squash-merged dev's work + more), sometimes dev is ahead (it has
follow-up PRs). Always diff both ways before deciding which version to favor.

### Files to Override from Main

These files have known semantic conflicts where dev's versions reference args
or APIs that main removed or renamed. Take main's version with
`git checkout origin/main -- <file>`:

- `megatron/training/training.py` — references dev-only args
- `megatron/training/initialize.py` — references dev-only args
- `megatron/training/utils.py` — references dev-only args
- `megatron/training/datasets/data_samplers.py` — references dev-only args
- `megatron/core/optimizer/layer_wise_optimizer.py` — constructor signature

**Caveat for ALL overrides:** After taking main's version of any file, you
MUST run the API Mismatch Detection procedure (see below) on that file.
Taking main's caller code while keeping dev's callee implementations is the
#1 source of sync bugs.

**IMPORTANT: Do NOT take main's `pyproject.toml`, `uv.lock`, or
`docker/Dockerfile.ci.dev`.** These three files are a tightly coupled
triple — the Dockerfile's `uv sync` command must match the dependency
groups in `pyproject.toml`, and `uv.lock` must be consistent with both.
Main's versions are missing dev-only dependencies (e.g.
`fast-hadamard-transform`, correct TransformerEngine revision) and the
`--group no_pypi_