live1,247 agents deployedbuilt by a solo devpowered by hermes
← All skillsSign up to install

bump-base-image

General0 installsUpdated 19d ago
VerifiedCuratedNVIDIA

Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs #4611 and #4688.

SKILL.md preview

---
name: bump-base-image
description: Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs #4611 and #4688.
when_to_use: User wants to upgrade the PyTorch container (e.g. "bump base image to 26.04"); CI is failing after a previous bump because the GitLab pin was missed; functional tests are failing with `lm loss` / `num-zeros` / `iteration-time` drift right after a container bump; a functional test hangs, times out, or OOMs after a bump; the user mentions `.ngc_version.dev`, `nvcr.io/nvidia/pytorch`, "container base image", or "Update Docker image version".
---

# Bump the PyTorch base image

End-to-end workflow for moving Megatron-LM's CI to a newer `nvcr.io/nvidia/pytorch:<YY.MM>-py3` container. The most common failure mode is forgetting that **GitHub CI and GitLab CI have separate pins** — a bump that only touches the former lands green, then breaks GitLab CI on `main` and forces an immediate follow-up PR. Always update both in the same PR.

## Inputs to gather from the user

1. **Target tag**, e.g. `26.04-py3`. NVIDIA NGC PyTorch containers are released as `nvcr.io/nvidia/pytorch:YY.MM-py3`.
2. **Scope** — usually `dev` only. The `lts` pin (`docker/.ngc_version.lts`, plus the `IMAGE_TYPE: lts` rows in GitLab) is bumped on a different cadence; only touch it if the user explicitly asks.
3. **Workflow run ID** (optional but typical) — after the first CI run, the user will provide a GitHub Actions run ID for golden-value refresh.

## Workflow

```
- [ ] Step 1: Update the GitHub CI pin (docker/.ngc_version.dev)
- [ ] Step 2: Update the GitLab CI pin (.gitlab/stages/01.build.yml)
- [ ] Step 3: Open the PR with the `Run functional tests` label
- [ ] Step 4: Re-run failing tests via `/ok to test <commit-sha>`
- [ ] Step 5: For golden-value drift → refresh with the `update-golden-values` skill
- [ ] Step 6: For hangs / real regressions → mark tests `mr-broken` and file tracking issues
- [ ] Step 7: Verify both pins are in sync before merging
```

### Step 1 — GitHub CI pin

`docker/.ngc_version.dev` is a single-line file consumed by `docker/Dockerfile.ci.dev` (via `FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)`). Overwrite it:

```bash
echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev
```

The file has no trailing newline historically; preserving or adding one is fine — the build args treat the value as `$(cat ...)`. Do **not** touch `docker/.ngc_version.lts` unless bumping LTS too.

### Step 2 — GitLab CI pin

GitLab CI does **not** read `docker/.ngc_version.dev`. It hardcodes `BASE_IMAGE` in a `parallel: matrix:` block. Update the two `IMAGE_TYPE: dev` rows (one per platform):

```yaml
# .gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix
- IMAGE: CI_MCORE_DEV_IMAGE
  FILE: Dockerfile.ci.dev
  IMAGE_TYPE: dev
  BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3   # amd64 row
  PLATFORM: amd64
- IMAGE: CI_MCORE_DEV_IMAGE
  FILE: Dockerfile.ci.dev
  IMAGE_TYPE: dev
  BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3   # arm64 row
  PLATFORM: arm64
```

Leave the `IMAGE_TYPE: lts` rows alone. Quick sanity check before commit:

```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml
# expect:  lts pin × 2 unchanged, dev pin × 2 == new tag
```

### Step 3 — Open the PR

- Title convention: `chore: Update Docker image version to <YY.MM>-py3` (see #4611).
- **Apply the `Run functional tests` label** before the first push. This unlocks the full functional matrix on the PR; without it the bump only runs the standard GH PR checks and you'll miss the drift.
- Push as draft first if you're still iterating; the bot will auto-draft otherwise.

### Step 4 — Re-running CI on a new commit

For PRs from fork