ci: use single-CUDA NVHPC Docker images to reduce runner disk usage by sbryngelson · Pull Request #1350 · MFlowCode/MFC

sbryngelson · 2026-04-01T22:29:42Z

Summary

Switch NVHPC CI containers from cuda_multi to single-CUDA tags (e.g. cuda12.6)
The cuda_multi images bundle every CUDA toolkit version, bloating the image and often exceeding GitHub runner disk limits
Each NVHPC version now pins to its highest bundled CUDA 12.x release
No change to build/test logic — only the container image tag changes

Test plan

All NVHPC matrix jobs pull the correct single-CUDA image
CPU build+test and GPU build-only pass across all NVHPC versions

Updated NVHPC configuration to simplify CUDA handling and improve disk management during CI runs.

github-actions · 2026-04-02T01:24:49Z

Claude Code Review

Head SHA: 4aaa559

Files changed:

2
.github/workflows/test.yml
CMakeLists.txt

Findings:

Lint failure: === echo separators in .github/ shell file — Three new echo statements added to test.yml use === separators and exceed 20 characters, violating the shell lint rules enforced by ./mfc.sh precheck for all files under .github/:
```
echo "=== Disk before cleanup ==="   # 25 chars, uses ===
echo "=== Disk after cleanup ==="    # 23 chars, uses ===
echo "=== NVHPC Environment ==="     # 22 chars, uses ===
```
The lint rule is: === separator patterns must be removed; echo separators must be ≤ 20 characters. These are in the "Free disk space" and "Setup NVHPC" steps. Replace with short labels such as echo "Disk before:" / echo "NVHPC env:" to stay under the limit and avoid ===.

The repo is bind-mounted into the container from the host runner, so git sees a different owner and emits 'fatal: detected dubious ownership' on every git command. This causes the test runner to dump the git diff help text 546 times (once per test), bloating the CI log to 80,000+ lines. Fix: git config --global --add safe.directory /workspace in Setup NVHPC.

qodo-code-review · 2026-04-02T17:59:02Z

Review Summary by Qodo

Optimize NVHPC CI with single-CUDA images and explicit disk management

✨ Enhancement 🐞 Bug fix

Walkthroughs

Description

• Replace NVHPC cuda_multi Docker images with single-CUDA tags to reduce runner disk usage
• Implement explicit disk cleanup before pulling large container images
• Refactor NVHPC container setup to use long-lived docker run with docker exec for better
  control
• Fix git "dubious ownership" errors by configuring safe directory in container
• Minor comment formatting fix in CMakeLists.txt

Diagram

flowchart LR
  A["NVHPC CI Setup"] -->|"Replace cuda_multi"| B["Single-CUDA Images"]
  A -->|"Add disk cleanup"| C["Free Disk Space"]
  C -->|"Then pull"| B
  B -->|"Long-lived container"| D["docker run sleep"]
  D -->|"Execute steps via"| E["docker exec"]
  E -->|"Source env vars"| F["Build & Test"]
  A -->|"Fix git config"| G["safe.directory"]
  G -->|"Suppress errors"| F

File Changes

1. .github/workflows/test.yml ✨ Enhancement +100/-47

Refactor NVHPC container management and disk handling

• Replaced inline container: directive with explicit docker pull and docker run -d commands to
 enable disk cleanup before pulling large images
• Added "Free disk space" step that removes unnecessary packages and prunes Docker images to prevent
 runner disk exhaustion
• Moved NVHPC environment variables from GitHub Actions env block to docker run -e flags for
 better container isolation
• Refactored NVHPC setup into a dedicated step using docker exec that installs dependencies,
 configures git safe directory, and sets up MPI runtime paths
• Wrapped all NVHPC build and test commands with docker exec nvhpc bash -c to run inside the
 long-lived container
• Added cleanup step to stop and remove the NVHPC container after job completion
• Updated comments to use consistent formatting with ── separators for better readability

.github/workflows/test.yml

2. CMakeLists.txt Formatting +1/-1

Fix comment formatting

• Fixed comment formatting by removing trailing dash and extra space for consistency

CMakeLists.txt

qodo-code-review · 2026-04-02T17:59:04Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. Broken heredoc terminator 🐞 Bug ≡ Correctness

Description

The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so
bash never recognizes the terminator and consumes the remainder of the script as heredoc content.
This prevents the environment setup from completing correctly and breaks later NVHPC steps that
source /etc/nvhpc-env.sh.

Code

.github/workflows/test.yml[R323-326]

+            cat > /etc/nvhpc-env.sh <<EOF
+          export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH
+          export OMPI_MCA_rmaps_base_oversubscribe=1
+          EOF

Evidence

In the NVHPC setup script, cat > /etc/nvhpc-env.sh <<EOF starts a heredoc, but the delimiter EOF
is not at the start of the line (it’s indented), so bash won’t terminate the heredoc normally;
subsequent lines are treated as file content instead of commands. Later steps explicitly `source
/etc/nvhpc-env.sh`, so a malformed/missing file will cause NVHPC build/test to fail.

.github/workflows/test.yml[304-336]
.github/workflows/test.yml[395-420]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The `Setup NVHPC` step uses a heredoc to generate `/etc/nvhpc-env.sh`, but the closing `EOF` delimiter is indented. In bash, the heredoc terminator must match exactly at the start of the line (unless using `<<-` with tabs), so the heredoc won’t close and the remainder of the script is treated as heredoc content.

### Issue Context
This breaks later steps that run `docker exec ... source /etc/nvhpc-env.sh`.

### Fix Focus Areas
- .github/workflows/test.yml[307-336]

### Suggested fix
Inside the `bash -c ' ... '` script, left-align the heredoc delimiter (and ideally the heredoc body) so the terminator is at column 1, e.g.:

```sh
cat > /etc/nvhpc-env.sh <<'EOF'
export LD_LIBRARY_PATH=...
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
```

Alternatively, avoid heredocs entirely and use `printf` to write the file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. NVHPC tag still cuda_multi 🐞 Bug ≡ Correctness

Description

NVHPC_IMAGE is still constructed with the cuda_multi tag, so NVHPC jobs will continue pulling
the multi-CUDA images rather than the single-CUDA tags described in the PR. This undermines the PR’s
disk-usage reduction intent and keeps the workflow dependent on large-image disk cleanup behavior.

Code

.github/workflows/test.yml[R217-218]

+      # Image tag for NVHPC jobs; empty for non-NVHPC jobs.
+      NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }}

Evidence

The workflow explicitly builds NVHPC_IMAGE using ...-devel-cuda_multi-ubuntu22.04, and the
workflow comment/steps are still designed around pulling a ~25–30GB cuda_multi image. This
conflicts with the PR description stating NVHPC CI containers are being switched from cuda_multi
to single-CUDA tags (e.g. cuda12.6).

.github/workflows/test.yml[216-234]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The workflow still pulls `nvcr.io/nvidia/nvhpc:* -devel-cuda_multi-ubuntu22.04`, so it does not implement the PR’s stated switch to single-CUDA tags.

### Issue Context
PR description explicitly states moving off `cuda_multi` to single-CUDA tags to reduce runner disk usage.

### Fix Focus Areas
- .github/workflows/test.yml[216-218]
- .github/workflows/test.yml[277-303]

### Suggested fix
Add an explicit CUDA tag to the NVHPC matrix (e.g. `cuda: '12.6'`) and construct the image from that field:

- Extend each NVHPC `matrix.include` entry with a `cuda` value.
- Change `NVHPC_IMAGE` to:
 `nvcr.io/nvidia/nvhpc:${{ matrix.nvhpc }}-devel-cuda${{ matrix.cuda }}-ubuntu22.04`

This makes the intended tag change explicit and prevents accidentally continuing to pull `cuda_multi`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

coderabbitai · 2026-04-02T18:01:12Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0bb23b70-f853-4264-bd35-339b98637c66

📥 Commits

Reviewing files that changed from the base of the PR and between 0dea6cc and 4aaa559.

📒 Files selected for processing (2)

.github/workflows/test.yml
CMakeLists.txt

📝 Walkthrough

Walkthrough

This pull request refactors the NVHPC test workflow setup in .github/workflows/test.yml. Instead of using the job-level container: directive, the workflow now explicitly manages an NVHPC Docker container by computing an image variable, performing disk cleanup, pulling the image, and running a long-lived container with docker run -d. Build dependencies are installed in a dedicated setup step, and NVHPC commands are executed via docker exec. A cleanup step removes the container after tests complete. Additionally, a minor comment edit is made in CMakeLists.txt.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-04-02T18:02:40Z

.github/workflows/test.yml

+            cat > /etc/nvhpc-env.sh <<EOF
+          export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH
+          export OMPI_MCA_rmaps_base_oversubscribe=1
+          EOF


1. Broken heredoc terminator 🐞 Bug ≡ Correctness

The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so bash never recognizes the terminator and consumes the remainder of the script as heredoc content. This prevents the environment setup from completing correctly and breaks later NVHPC steps that source /etc/nvhpc-env.sh.

Agent Prompt

### Issue description The `Setup NVHPC` step uses a heredoc to generate `/etc/nvhpc-env.sh`, but the closing `EOF` delimiter is indented. In bash, the heredoc terminator must match exactly at the start of the line (unless using `<<-` with tabs), so the heredoc won’t close and the remainder of the script is treated as heredoc content. ### Issue Context This breaks later steps that run `docker exec ... source /etc/nvhpc-env.sh`. ### Fix Focus Areas - .github/workflows/test.yml[307-336] ### Suggested fix Inside the `bash -c ' ... '` script, left-align the heredoc delimiter (and ideally the heredoc body) so the terminator is at column 1, e.g.: ```sh cat > /etc/nvhpc-env.sh <<'EOF' export LD_LIBRARY_PATH=... export OMPI_MCA_rmaps_base_oversubscribe=1 EOF ``` Alternatively, avoid heredocs entirely and use `printf` to write the file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-04-02T18:02:40Z

.github/workflows/test.yml

+      # Image tag for NVHPC jobs; empty for non-NVHPC jobs.
+      NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }}


2. Nvhpc tag still cuda_multi 🐞 Bug ≡ Correctness

NVHPC_IMAGE is still constructed with the cuda_multi tag, so NVHPC jobs will continue pulling the multi-CUDA images rather than the single-CUDA tags described in the PR. This undermines the PR’s disk-usage reduction intent and keeps the workflow dependent on large-image disk cleanup behavior.

Agent Prompt

### Issue description The workflow still pulls `nvcr.io/nvidia/nvhpc:* -devel-cuda_multi-ubuntu22.04`, so it does not implement the PR’s stated switch to single-CUDA tags. ### Issue Context PR description explicitly states moving off `cuda_multi` to single-CUDA tags to reduce runner disk usage. ### Fix Focus Areas - .github/workflows/test.yml[216-218] - .github/workflows/test.yml[277-303] ### Suggested fix Add an explicit CUDA tag to the NVHPC matrix (e.g. `cuda: '12.6'`) and construct the image from that field: - Extend each NVHPC `matrix.include` entry with a `cuda` value. - Change `NVHPC_IMAGE` to: `nvcr.io/nvidia/nvhpc:${{ matrix.nvhpc }}-devel-cuda${{ matrix.cuda }}-ubuntu22.04` This makes the intended tag change explicit and prevents accidentally continuing to pull `cuda_multi`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Port fix from MFlowCode/MFC#1350: the ~25-30GB NVHPC cuda_multi Docker images exceed GitHub runner disk space when pulled via the container: directive (which runs before any steps). Fix: remove container: directive, add explicit steps to free disk space (remove dotnet/android/ghc/boost/chromium), then docker pull + docker run -d + docker exec for NVHPC jobs. Non-NVHPC jobs are unchanged. Also fix: fetch main instead of master for coverage diff.

codecov · 2026-04-02T20:06:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.67%. Comparing base (0dea6cc) to head (4aaa559).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1350   +/-   ##
=======================================
  Coverage   64.67%   64.67%           
=======================================
  Files          70       70           
  Lines       18251    18251           
  Branches     1504     1504           
=======================================
  Hits        11804    11804           
  Misses       5492     5492           
  Partials      955      955

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson and others added 4 commits April 1, 2026 18:28

ci: use single-CUDA Docker tags for NVHPC to fit runner disk

26572fb

Fix comment formatting in CMakeLists.txt

d6339ae

Refactor NVHPC setup in GitHub Actions workflow

3b651b1

Updated NVHPC configuration to simplify CUDA handling and improve disk management during CI runs.

Merge branch 'master' into small-docker

fb18e49

sbryngelson added 2 commits April 2, 2026 08:12

Merge branch 'master' into small-docker

4aaa559

sbryngelson marked this pull request as ready for review April 2, 2026 17:58

sbryngelson merged commit 2d15ba9 into MFlowCode:master Apr 2, 2026
66 of 67 checks passed

sbryngelson deleted the small-docker branch April 2, 2026 17:58

qodo-code-review bot reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: use single-CUDA NVHPC Docker images to reduce runner disk usage#1350

ci: use single-CUDA NVHPC Docker images to reduce runner disk usage#1350
sbryngelson merged 6 commits intoMFlowCode:masterfrom
sbryngelson:small-docker

sbryngelson commented Apr 1, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

qodo-code-review bot commented Apr 2, 2026

Uh oh!

qodo-code-review bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026

Review failed

Walkthrough

Uh oh!

qodo-code-review bot Apr 2, 2026

Uh oh!

qodo-code-review bot Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

		# Image tag for NVHPC jobs; empty for non-NVHPC jobs.
		NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) \|\| '' }}

Conversation

sbryngelson commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

qodo-code-review bot commented Apr 2, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

coderabbitai bot commented Apr 2, 2026

Review failed

Walkthrough

Uh oh!

qodo-code-review bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

sbryngelson commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 2, 2026 •

edited

Loading

qodo-code-review bot commented Apr 2, 2026 •

edited

Loading