Skip activation kernels when tensor size is zero#2848
Skip activation kernels when tensor size is zero#2848timmoon10 wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci |
Greptile SummaryThis PR fixes obscure CUDA errors triggered by calling SwiGLU and related quantization kernels on zero-element tensors by inserting early-return guards before kernel launches in 11 Confidence Score: 5/5Safe to merge — all remaining findings are minor dead-code style issues that do not affect runtime behavior. The zero-size guards are consistently and correctly placed before alignment checks, TMA tensor-map creation, and kernel launches across all affected files. The IS_DBIAS dead-code branch in quantize_mxfp8.cuh and group_quantize_mxfp8.cuh is a cosmetic issue only; the actual error behavior for empty IS_DBIAS tensors is correct. Test coverage for empty tensors has been added to all relevant test suites. transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh and group_quantize_mxfp8.cuh contain a dead NVTE_ERROR branch (P2 style). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Kernel dispatch entry] --> B{IS_DBIAS?}
B -- yes --> C[Validate dbias dtype/shape/workspace]
C --> D{dbias_rows > 0 AND dbias_cols > 0?}
D -- no --> E[NVTE_CHECK error: Invalid workspace shape]
D -- yes --> F{workspace dptr == nullptr? workspace-size query}
F -- yes --> G[Set workspace shape & return]
F -- no --> H{rows == 0 OR cols == 0?}
B -- no --> H
H -- yes --> I[Return early: skip kernel launch]
H -- no --> J[Proceed with CUDA kernel launch]
style E fill:#f66,color:#fff
style I fill:#6a6,color:#fff
style J fill:#46a,color:#fff
Reviews (4): Last reviewed commit: "Merge branch 'main' into tmoon/activatio..." | Re-trigger Greptile |
| auto err = cudaGetLastError(); | ||
| ASSERT_EQ(err, cudaSuccess) << cudaGetErrorString(err); | ||
| if (isFp8Type(otype)) { | ||
| if (isFp8Type(otype) && full_size > 0) { |
There was a problem hiding this comment.
So this problem shows up not only for activations, but also for a regular cast?
There was a problem hiding this comment.
Yep, many of our kernels are not robust to empty tensors. I still expect to see problems in the FP8 block-scale quantization kernels and transpose kernels.
| // Skip kernel if tensor size is zero | ||
| if (elts_total == 0) { | ||
| if constexpr (IS_DBIAS) { | ||
| NVTE_ERROR("Invalid grouped tensor shape for DBias computation (first_logical_dim=", |
There was a problem hiding this comment.
In this case we can output dbias = zero tensor also right instead of throwing an error?
|
/te-ci |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci |
|
/te-ci |
Description
We have encountered some obscure CUDA errors when calling SwiGLU on a tensor with no entries. This PR handles this case by skipping the kernel launch when the tensor has no entries. Since the kernel implementation is heavily templated, this also fixes this bug for some quantization cases, although I haven't handled them exhaustively.
Type of change
Changes
Checklist: