Skip to content

Skip activation kernels when tensor size is zero#2848

Open
timmoon10 wants to merge 7 commits intoNVIDIA:mainfrom
timmoon10:tmoon/activations-with-zero-size-tensor
Open

Skip activation kernels when tensor size is zero#2848
timmoon10 wants to merge 7 commits intoNVIDIA:mainfrom
timmoon10:tmoon/activations-with-zero-size-tensor

Conversation

@timmoon10
Copy link
Copy Markdown
Collaborator

Description

We have encountered some obscure CUDA errors when calling SwiGLU on a tensor with no entries. This PR handles this case by skipping the kernel launch when the tensor has no entries. Since the kernel implementation is heavily templated, this also fixes this bug for some quantization cases, although I haven't handled them exhaustively.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Skip activation kernels and quantization kernels when tensor size is zero.
  • Add empty-tensor test cases in activation unit tests.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 requested a review from Oleg-Goncharov April 8, 2026 03:47
@timmoon10 timmoon10 added the bug Something isn't working label Apr 8, 2026
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR fixes obscure CUDA errors triggered by calling SwiGLU and related quantization kernels on zero-element tensors by inserting early-return guards before kernel launches in 11 .cuh files across FP8, MXFP8, and NVFP4 paths. Empty-tensor test cases ({0, 128} and {128, 0}) are added to the activation and cast unit tests to prevent regressions.

Confidence Score: 5/5

Safe to merge — all remaining findings are minor dead-code style issues that do not affect runtime behavior.

The zero-size guards are consistently and correctly placed before alignment checks, TMA tensor-map creation, and kernel launches across all affected files. The IS_DBIAS dead-code branch in quantize_mxfp8.cuh and group_quantize_mxfp8.cuh is a cosmetic issue only; the actual error behavior for empty IS_DBIAS tensors is correct. Test coverage for empty tensors has been added to all relevant test suites.

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh and group_quantize_mxfp8.cuh contain a dead NVTE_ERROR branch (P2 style).

Important Files Changed

Filename Overview
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh Adds IS_DBIAS workspace-shape guard and zero-size early return; IS_DBIAS branch inside zero-size guard (lines 639-641) is unreachable dead code because the NVTE_CHECK at line 626 always fires first for empty tensors.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Adds IS_DBIAS workspace-shape guard and zero-size early return; same dead IS_DBIAS NVTE_ERROR pattern as quantize_mxfp8.cuh (lines 873-876 unreachable).
transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh Moves rows/cols/output_cols declarations before a new early-return zero-size guard; also adds an else-NVTE_ERROR for the previously unguarded case where neither rowwise nor colwise scaling is set.
transformer_engine/common/cast/fp8/gated_fp8.cuh Adds early return in cast_gated_tma when rows==0 or cols==0, correctly placed before grid dimension computation.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh Adds early return when N==0 or M==0, correctly placed before the FP4_BLOCK_SIZE divisibility check that would otherwise fail vacuously.
transformer_engine/common/cast/fp8/quantize_fp8.cuh Adds early return in quantize_1D when N==0, correctly placed before tile-alignment check.
transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh Adds early return when rows==0 or cols==0, correctly placed before TMA tensor-map creation.
transformer_engine/common/cast/nvfp4/group_quantize_transpose_nvfp4.cuh Adds early return before the 32-row alignment check, correctly skipping it for empty tensors.
transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh Adds early return when rows==0 or cols==0, correctly placed before kernel launch configuration.
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Adds early return when rows==0 or cols==0, before the 32-row alignment check.
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh Adds early return when rows==0 or cols==0, before the 32-row alignment check in the tuned 1D path.
tests/cpp/operator/test_act.cu Adds {0,128} and {128,0} empty-tensor test cases; guards amax comparison with N*H>0.
tests/cpp/operator/test_cast.cu Adds empty-tensor cases and guards amax/scale_inv comparisons with full_size>0 check.
tests/cpp/operator/test_cast_gated_swiglu.cu Adds empty-tensor cases; guards amax/scale_inv comparison with input_size>0.
tests/cpp/operator/test_cast_mxfp8_gated_swiglu.cu Adds {0,128}/{128,0} empty-tensor matrix size cases to the mxfp8 gated SwiGLU test suite.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel dispatch entry] --> B{IS_DBIAS?}
    B -- yes --> C[Validate dbias dtype/shape/workspace]
    C --> D{dbias_rows > 0 AND dbias_cols > 0?}
    D -- no --> E[NVTE_CHECK error: Invalid workspace shape]
    D -- yes --> F{workspace dptr == nullptr? workspace-size query}
    F -- yes --> G[Set workspace shape & return]
    F -- no --> H{rows == 0 OR cols == 0?}
    B -- no --> H
    H -- yes --> I[Return early: skip kernel launch]
    H -- no --> J[Proceed with CUDA kernel launch]
    style E fill:#f66,color:#fff
    style I fill:#6a6,color:#fff
    style J fill:#46a,color:#fff
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into tmoon/activatio..." | Re-trigger Greptile

auto err = cudaGetLastError();
ASSERT_EQ(err, cudaSuccess) << cudaGetErrorString(err);
if (isFp8Type(otype)) {
if (isFp8Type(otype) && full_size > 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this problem shows up not only for activations, but also for a regular cast?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, many of our kernels are not robust to empty tensors. I still expect to see problems in the FP8 block-scale quantization kernels and transpose kernels.

Oleg-Goncharov
Oleg-Goncharov previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@Oleg-Goncharov Oleg-Goncharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// Skip kernel if tensor size is zero
if (elts_total == 0) {
if constexpr (IS_DBIAS) {
NVTE_ERROR("Invalid grouped tensor shape for DBias computation (first_logical_dim=",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we can output dbias = zero tensor also right instead of throwing an error?

@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants