ci: use single-CUDA NVHPC Docker images to reduce runner disk usage#1350
ci: use single-CUDA NVHPC Docker images to reduce runner disk usage#1350sbryngelson merged 6 commits intoMFlowCode:masterfrom
Conversation
Updated NVHPC configuration to simplify CUDA handling and improve disk management during CI runs.
|
Claude Code Review Head SHA: 4aaa559 Files changed:
Findings:
|
The repo is bind-mounted into the container from the host runner, so git sees a different owner and emits 'fatal: detected dubious ownership' on every git command. This causes the test runner to dump the git diff help text 546 times (once per test), bloating the CI log to 80,000+ lines. Fix: git config --global --add safe.directory /workspace in Setup NVHPC.
Review Summary by QodoOptimize NVHPC CI with single-CUDA images and explicit disk management
WalkthroughsDescription• Replace NVHPC cuda_multi Docker images with single-CUDA tags to reduce runner disk usage • Implement explicit disk cleanup before pulling large container images • Refactor NVHPC container setup to use long-lived docker run with docker exec for better control • Fix git "dubious ownership" errors by configuring safe directory in container • Minor comment formatting fix in CMakeLists.txt Diagramflowchart LR
A["NVHPC CI Setup"] -->|"Replace cuda_multi"| B["Single-CUDA Images"]
A -->|"Add disk cleanup"| C["Free Disk Space"]
C -->|"Then pull"| B
B -->|"Long-lived container"| D["docker run sleep"]
D -->|"Execute steps via"| E["docker exec"]
E -->|"Source env vars"| F["Build & Test"]
A -->|"Fix git config"| G["safe.directory"]
G -->|"Suppress errors"| F
File Changes1. .github/workflows/test.yml
|
Code Review by Qodo
1. Broken heredoc terminator
|
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis pull request refactors the NVHPC test workflow setup in Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| cat > /etc/nvhpc-env.sh <<EOF | ||
| export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH | ||
| export OMPI_MCA_rmaps_base_oversubscribe=1 | ||
| EOF |
There was a problem hiding this comment.
1. Broken heredoc terminator 🐞 Bug ≡ Correctness
The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so bash never recognizes the terminator and consumes the remainder of the script as heredoc content. This prevents the environment setup from completing correctly and breaks later NVHPC steps that source /etc/nvhpc-env.sh.
Agent Prompt
### Issue description
The `Setup NVHPC` step uses a heredoc to generate `/etc/nvhpc-env.sh`, but the closing `EOF` delimiter is indented. In bash, the heredoc terminator must match exactly at the start of the line (unless using `<<-` with tabs), so the heredoc won’t close and the remainder of the script is treated as heredoc content.
### Issue Context
This breaks later steps that run `docker exec ... source /etc/nvhpc-env.sh`.
### Fix Focus Areas
- .github/workflows/test.yml[307-336]
### Suggested fix
Inside the `bash -c ' ... '` script, left-align the heredoc delimiter (and ideally the heredoc body) so the terminator is at column 1, e.g.:
```sh
cat > /etc/nvhpc-env.sh <<'EOF'
export LD_LIBRARY_PATH=...
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
```
Alternatively, avoid heredocs entirely and use `printf` to write the file.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| # Image tag for NVHPC jobs; empty for non-NVHPC jobs. | ||
| NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }} |
There was a problem hiding this comment.
2. Nvhpc tag still cuda_multi 🐞 Bug ≡ Correctness
NVHPC_IMAGE is still constructed with the cuda_multi tag, so NVHPC jobs will continue pulling the multi-CUDA images rather than the single-CUDA tags described in the PR. This undermines the PR’s disk-usage reduction intent and keeps the workflow dependent on large-image disk cleanup behavior.
Agent Prompt
### Issue description
The workflow still pulls `nvcr.io/nvidia/nvhpc:* -devel-cuda_multi-ubuntu22.04`, so it does not implement the PR’s stated switch to single-CUDA tags.
### Issue Context
PR description explicitly states moving off `cuda_multi` to single-CUDA tags to reduce runner disk usage.
### Fix Focus Areas
- .github/workflows/test.yml[216-218]
- .github/workflows/test.yml[277-303]
### Suggested fix
Add an explicit CUDA tag to the NVHPC matrix (e.g. `cuda: '12.6'`) and construct the image from that field:
- Extend each NVHPC `matrix.include` entry with a `cuda` value.
- Change `NVHPC_IMAGE` to:
`nvcr.io/nvidia/nvhpc:${{ matrix.nvhpc }}-devel-cuda${{ matrix.cuda }}-ubuntu22.04`
This makes the intended tag change explicit and prevents accidentally continuing to pull `cuda_multi`.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Port fix from MFlowCode/MFC#1350: the ~25-30GB NVHPC cuda_multi Docker images exceed GitHub runner disk space when pulled via the container: directive (which runs before any steps). Fix: remove container: directive, add explicit steps to free disk space (remove dotnet/android/ghc/boost/chromium), then docker pull + docker run -d + docker exec for NVHPC jobs. Non-NVHPC jobs are unchanged. Also fix: fetch main instead of master for coverage diff.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1350 +/- ##
=======================================
Coverage 64.67% 64.67%
=======================================
Files 70 70
Lines 18251 18251
Branches 1504 1504
=======================================
Hits 11804 11804
Misses 5492 5492
Partials 955 955 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
cuda_multito single-CUDA tags (e.g.cuda12.6)cuda_multiimages bundle every CUDA toolkit version, bloating the image and often exceeding GitHub runner disk limitsTest plan