Skip to content

ci: use single-CUDA NVHPC Docker images to reduce runner disk usage#1350

Merged
sbryngelson merged 6 commits intoMFlowCode:masterfrom
sbryngelson:small-docker
Apr 2, 2026
Merged

ci: use single-CUDA NVHPC Docker images to reduce runner disk usage#1350
sbryngelson merged 6 commits intoMFlowCode:masterfrom
sbryngelson:small-docker

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

@sbryngelson sbryngelson commented Apr 1, 2026

Summary

  • Switch NVHPC CI containers from cuda_multi to single-CUDA tags (e.g. cuda12.6)
  • The cuda_multi images bundle every CUDA toolkit version, bloating the image and often exceeding GitHub runner disk limits
  • Each NVHPC version now pins to its highest bundled CUDA 12.x release
  • No change to build/test logic — only the container image tag changes

Test plan

  • All NVHPC matrix jobs pull the correct single-CUDA image
  • CPU build+test and GPU build-only pass across all NVHPC versions

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Claude Code Review

Head SHA: 4aaa559

Files changed:

  • 2
  • .github/workflows/test.yml
  • CMakeLists.txt

Findings:

  • Lint failure: === echo separators in .github/ shell file — Three new echo statements added to test.yml use === separators and exceed 20 characters, violating the shell lint rules enforced by ./mfc.sh precheck for all files under .github/:
    echo "=== Disk before cleanup ==="   # 25 chars, uses ===
    echo "=== Disk after cleanup ==="    # 23 chars, uses ===
    echo "=== NVHPC Environment ==="     # 22 chars, uses ===
    The lint rule is: === separator patterns must be removed; echo separators must be ≤ 20 characters. These are in the "Free disk space" and "Setup NVHPC" steps. Replace with short labels such as echo "Disk before:" / echo "NVHPC env:" to stay under the limit and avoid ===.

The repo is bind-mounted into the container from the host runner, so
git sees a different owner and emits 'fatal: detected dubious ownership'
on every git command. This causes the test runner to dump the git diff
help text 546 times (once per test), bloating the CI log to 80,000+ lines.

Fix: git config --global --add safe.directory /workspace in Setup NVHPC.
@sbryngelson sbryngelson marked this pull request as ready for review April 2, 2026 17:58
@sbryngelson sbryngelson merged commit 2d15ba9 into MFlowCode:master Apr 2, 2026
66 of 67 checks passed
@sbryngelson sbryngelson deleted the small-docker branch April 2, 2026 17:58
@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Optimize NVHPC CI with single-CUDA images and explicit disk management

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Replace NVHPC cuda_multi Docker images with single-CUDA tags to reduce runner disk usage
• Implement explicit disk cleanup before pulling large container images
• Refactor NVHPC container setup to use long-lived docker run with docker exec for better
  control
• Fix git "dubious ownership" errors by configuring safe directory in container
• Minor comment formatting fix in CMakeLists.txt
Diagram
flowchart LR
  A["NVHPC CI Setup"] -->|"Replace cuda_multi"| B["Single-CUDA Images"]
  A -->|"Add disk cleanup"| C["Free Disk Space"]
  C -->|"Then pull"| B
  B -->|"Long-lived container"| D["docker run sleep"]
  D -->|"Execute steps via"| E["docker exec"]
  E -->|"Source env vars"| F["Build & Test"]
  A -->|"Fix git config"| G["safe.directory"]
  G -->|"Suppress errors"| F
Loading

Grey Divider

File Changes

1. .github/workflows/test.yml ✨ Enhancement +100/-47

Refactor NVHPC container management and disk handling

• Replaced inline container: directive with explicit docker pull and docker run -d commands to
 enable disk cleanup before pulling large images
• Added "Free disk space" step that removes unnecessary packages and prunes Docker images to prevent
 runner disk exhaustion
• Moved NVHPC environment variables from GitHub Actions env block to docker run -e flags for
 better container isolation
• Refactored NVHPC setup into a dedicated step using docker exec that installs dependencies,
 configures git safe directory, and sets up MPI runtime paths
• Wrapped all NVHPC build and test commands with docker exec nvhpc bash -c to run inside the
 long-lived container
• Added cleanup step to stop and remove the NVHPC container after job completion
• Updated comments to use consistent formatting with ── separators for better readability

.github/workflows/test.yml


2. CMakeLists.txt Formatting +1/-1

Fix comment formatting

• Fixed comment formatting by removing trailing dash and extra space for consistency

CMakeLists.txt


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review bot commented Apr 2, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Broken heredoc terminator 🐞 Bug ≡ Correctness
Description
The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so
bash never recognizes the terminator and consumes the remainder of the script as heredoc content.
This prevents the environment setup from completing correctly and breaks later NVHPC steps that
source /etc/nvhpc-env.sh.
Code

.github/workflows/test.yml[R323-326]

+            cat > /etc/nvhpc-env.sh <<EOF
+          export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH
+          export OMPI_MCA_rmaps_base_oversubscribe=1
+          EOF
Evidence
In the NVHPC setup script, cat > /etc/nvhpc-env.sh <<EOF starts a heredoc, but the delimiter EOF
is not at the start of the line (it’s indented), so bash won’t terminate the heredoc normally;
subsequent lines are treated as file content instead of commands. Later steps explicitly `source
/etc/nvhpc-env.sh`, so a malformed/missing file will cause NVHPC build/test to fail.

.github/workflows/test.yml[304-336]
.github/workflows/test.yml[395-420]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The `Setup NVHPC` step uses a heredoc to generate `/etc/nvhpc-env.sh`, but the closing `EOF` delimiter is indented. In bash, the heredoc terminator must match exactly at the start of the line (unless using `<<-` with tabs), so the heredoc won’t close and the remainder of the script is treated as heredoc content.

### Issue Context
This breaks later steps that run `docker exec ... source /etc/nvhpc-env.sh`.

### Fix Focus Areas
- .github/workflows/test.yml[307-336]

### Suggested fix
Inside the `bash -c ' ... '` script, left-align the heredoc delimiter (and ideally the heredoc body) so the terminator is at column 1, e.g.:

```sh
cat > /etc/nvhpc-env.sh <<'EOF'
export LD_LIBRARY_PATH=...
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
```

Alternatively, avoid heredocs entirely and use `printf` to write the file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. NVHPC tag still cuda_multi 🐞 Bug ≡ Correctness
Description
NVHPC_IMAGE is still constructed with the cuda_multi tag, so NVHPC jobs will continue pulling
the multi-CUDA images rather than the single-CUDA tags described in the PR. This undermines the PR’s
disk-usage reduction intent and keeps the workflow dependent on large-image disk cleanup behavior.
Code

.github/workflows/test.yml[R217-218]

+      # Image tag for NVHPC jobs; empty for non-NVHPC jobs.
+      NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }}
Evidence
The workflow explicitly builds NVHPC_IMAGE using ...-devel-cuda_multi-ubuntu22.04, and the
workflow comment/steps are still designed around pulling a ~25–30GB cuda_multi image. This
conflicts with the PR description stating NVHPC CI containers are being switched from cuda_multi
to single-CUDA tags (e.g. cuda12.6).

.github/workflows/test.yml[216-234]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The workflow still pulls `nvcr.io/nvidia/nvhpc:* -devel-cuda_multi-ubuntu22.04`, so it does not implement the PR’s stated switch to single-CUDA tags.

### Issue Context
PR description explicitly states moving off `cuda_multi` to single-CUDA tags to reduce runner disk usage.

### Fix Focus Areas
- .github/workflows/test.yml[216-218]
- .github/workflows/test.yml[277-303]

### Suggested fix
Add an explicit CUDA tag to the NVHPC matrix (e.g. `cuda: '12.6'`) and construct the image from that field:

- Extend each NVHPC `matrix.include` entry with a `cuda` value.
- Change `NVHPC_IMAGE` to:
 `nvcr.io/nvidia/nvhpc:${{ matrix.nvhpc }}-devel-cuda${{ matrix.cuda }}-ubuntu22.04`

This makes the intended tag change explicit and prevents accidentally continuing to pull `cuda_multi`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0bb23b70-f853-4264-bd35-339b98637c66

📥 Commits

Reviewing files that changed from the base of the PR and between 0dea6cc and 4aaa559.

📒 Files selected for processing (2)
  • .github/workflows/test.yml
  • CMakeLists.txt

📝 Walkthrough

Walkthrough

This pull request refactors the NVHPC test workflow setup in .github/workflows/test.yml. Instead of using the job-level container: directive, the workflow now explicitly manages an NVHPC Docker container by computing an image variable, performing disk cleanup, pulling the image, and running a long-lived container with docker run -d. Build dependencies are installed in a dedicated setup step, and NVHPC commands are executed via docker exec. A cleanup step removes the container after tests complete. Additionally, a minor comment edit is made in CMakeLists.txt.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +323 to +326
cat > /etc/nvhpc-env.sh <<EOF
export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Broken heredoc terminator 🐞 Bug ≡ Correctness

The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so
bash never recognizes the terminator and consumes the remainder of the script as heredoc content.
This prevents the environment setup from completing correctly and breaks later NVHPC steps that
source /etc/nvhpc-env.sh.
Agent Prompt
### Issue description
The `Setup NVHPC` step uses a heredoc to generate `/etc/nvhpc-env.sh`, but the closing `EOF` delimiter is indented. In bash, the heredoc terminator must match exactly at the start of the line (unless using `<<-` with tabs), so the heredoc won’t close and the remainder of the script is treated as heredoc content.

### Issue Context
This breaks later steps that run `docker exec ... source /etc/nvhpc-env.sh`.

### Fix Focus Areas
- .github/workflows/test.yml[307-336]

### Suggested fix
Inside the `bash -c ' ... '` script, left-align the heredoc delimiter (and ideally the heredoc body) so the terminator is at column 1, e.g.:

```sh
cat > /etc/nvhpc-env.sh <<'EOF'
export LD_LIBRARY_PATH=...
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
```

Alternatively, avoid heredocs entirely and use `printf` to write the file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +217 to +218
# Image tag for NVHPC jobs; empty for non-NVHPC jobs.
NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Nvhpc tag still cuda_multi 🐞 Bug ≡ Correctness

NVHPC_IMAGE is still constructed with the cuda_multi tag, so NVHPC jobs will continue pulling
the multi-CUDA images rather than the single-CUDA tags described in the PR. This undermines the PR’s
disk-usage reduction intent and keeps the workflow dependent on large-image disk cleanup behavior.
Agent Prompt
### Issue description
The workflow still pulls `nvcr.io/nvidia/nvhpc:* -devel-cuda_multi-ubuntu22.04`, so it does not implement the PR’s stated switch to single-CUDA tags.

### Issue Context
PR description explicitly states moving off `cuda_multi` to single-CUDA tags to reduce runner disk usage.

### Fix Focus Areas
- .github/workflows/test.yml[216-218]
- .github/workflows/test.yml[277-303]

### Suggested fix
Add an explicit CUDA tag to the NVHPC matrix (e.g. `cuda: '12.6'`) and construct the image from that field:

- Extend each NVHPC `matrix.include` entry with a `cuda` value.
- Change `NVHPC_IMAGE` to:
  `nvcr.io/nvidia/nvhpc:${{ matrix.nvhpc }}-devel-cuda${{ matrix.cuda }}-ubuntu22.04`

This makes the intended tag change explicit and prevents accidentally continuing to pull `cuda_multi`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

sbryngelson added a commit to sbryngelson/figr that referenced this pull request Apr 2, 2026
Port fix from MFlowCode/MFC#1350: the ~25-30GB NVHPC cuda_multi Docker
images exceed GitHub runner disk space when pulled via the container:
directive (which runs before any steps).

Fix: remove container: directive, add explicit steps to free disk space
(remove dotnet/android/ghc/boost/chromium), then docker pull + docker
run -d + docker exec for NVHPC jobs. Non-NVHPC jobs are unchanged.

Also fix: fetch main instead of master for coverage diff.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.67%. Comparing base (0dea6cc) to head (4aaa559).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1350   +/-   ##
=======================================
  Coverage   64.67%   64.67%           
=======================================
  Files          70       70           
  Lines       18251    18251           
  Branches     1504     1504           
=======================================
  Hits        11804    11804           
  Misses       5492     5492           
  Partials      955      955           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant