Skip activation kernels when tensor size is zero by timmoon10 · Pull Request #2848 · NVIDIA/TransformerEngine

timmoon10 · 2026-04-08T03:47:33Z

Description

We have encountered some obscure CUDA errors when calling SwiGLU on a tensor with no entries. This PR handles this case by skipping the kernel launch when the tensor has no entries. Since the kernel implementation is heavily templated, this also fixes this bug for some quantization cases, although I haven't handled them exhaustively.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Skip activation kernels and quantization kernels when tensor size is zero.
Add empty-tensor test cases in activation unit tests.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-04-08T03:49:05Z

/te-ci

greptile-apps · 2026-04-08T03:51:49Z

Greptile Summary

This PR fixes obscure CUDA errors triggered by calling SwiGLU and related quantization kernels on zero-element tensors by inserting early-return guards before kernel launches in 11 .cuh files across FP8, MXFP8, and NVFP4 paths. Empty-tensor test cases ({0, 128} and {128, 0}) are added to the activation and cast unit tests to prevent regressions.

Confidence Score: 5/5

Safe to merge — all remaining findings are minor dead-code style issues that do not affect runtime behavior.

The zero-size guards are consistently and correctly placed before alignment checks, TMA tensor-map creation, and kernel launches across all affected files. The IS_DBIAS dead-code branch in quantize_mxfp8.cuh and group_quantize_mxfp8.cuh is a cosmetic issue only; the actual error behavior for empty IS_DBIAS tensors is correct. Test coverage for empty tensors has been added to all relevant test suites.

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh and group_quantize_mxfp8.cuh contain a dead NVTE_ERROR branch (P2 style).

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	Adds IS_DBIAS workspace-shape guard and zero-size early return; IS_DBIAS branch inside zero-size guard (lines 639-641) is unreachable dead code because the NVTE_CHECK at line 626 always fires first for empty tensors.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	Adds IS_DBIAS workspace-shape guard and zero-size early return; same dead IS_DBIAS NVTE_ERROR pattern as quantize_mxfp8.cuh (lines 873-876 unreachable).
transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh	Moves rows/cols/output_cols declarations before a new early-return zero-size guard; also adds an else-NVTE_ERROR for the previously unguarded case where neither rowwise nor colwise scaling is set.
transformer_engine/common/cast/fp8/gated_fp8.cuh	Adds early return in cast_gated_tma when rows==0 or cols==0, correctly placed before grid dimension computation.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Adds early return when N==0 or M==0, correctly placed before the FP4_BLOCK_SIZE divisibility check that would otherwise fail vacuously.
transformer_engine/common/cast/fp8/quantize_fp8.cuh	Adds early return in quantize_1D when N==0, correctly placed before tile-alignment check.
transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh	Adds early return when rows==0 or cols==0, correctly placed before TMA tensor-map creation.
transformer_engine/common/cast/nvfp4/group_quantize_transpose_nvfp4.cuh	Adds early return before the 32-row alignment check, correctly skipping it for empty tensors.
transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh	Adds early return when rows==0 or cols==0, correctly placed before kernel launch configuration.
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh	Adds early return when rows==0 or cols==0, before the 32-row alignment check.
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh	Adds early return when rows==0 or cols==0, before the 32-row alignment check in the tuned 1D path.
tests/cpp/operator/test_act.cu	Adds {0,128} and {128,0} empty-tensor test cases; guards amax comparison with N*H>0.
tests/cpp/operator/test_cast.cu	Adds empty-tensor cases and guards amax/scale_inv comparisons with full_size>0 check.
tests/cpp/operator/test_cast_gated_swiglu.cu	Adds empty-tensor cases; guards amax/scale_inv comparison with input_size>0.
tests/cpp/operator/test_cast_mxfp8_gated_swiglu.cu	Adds {0,128}/{128,0} empty-tensor matrix size cases to the mxfp8 gated SwiGLU test suite.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel dispatch entry] --> B{IS_DBIAS?}
    B -- yes --> C[Validate dbias dtype/shape/workspace]
    C --> D{dbias_rows > 0 AND dbias_cols > 0?}
    D -- no --> E[NVTE_CHECK error: Invalid workspace shape]
    D -- yes --> F{workspace dptr == nullptr? workspace-size query}
    F -- yes --> G[Set workspace shape & return]
    F -- no --> H{rows == 0 OR cols == 0?}
    B -- no --> H
    H -- yes --> I[Return early: skip kernel launch]
    H -- no --> J[Proceed with CUDA kernel launch]
    style E fill:#f66,color:#fff
    style I fill:#6a6,color:#fff
    style J fill:#46a,color:#fff

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into tmoon/activatio..." | Re-trigger Greptile}

Oleg-Goncharov · 2026-04-08T13:08:59Z

tests/cpp/operator/test_cast.cu

  auto err = cudaGetLastError();
  ASSERT_EQ(err, cudaSuccess) << cudaGetErrorString(err);
-  if (isFp8Type(otype)) {
+  if (isFp8Type(otype) && full_size > 0) {


So this problem shows up not only for activations, but also for a regular cast?

Yep, many of our kernels are not robust to empty tensors. I still expect to see problems in the FP8 block-scale quantization kernels and transpose kernels.

Oleg-Goncharov

LGTM

vthumbe1503 · 2026-04-08T15:59:06Z

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

+  // Skip kernel if tensor size is zero
+  if (elts_total == 0) {
+    if constexpr (IS_DBIAS) {
+      NVTE_ERROR("Invalid grouped tensor shape for DBias computation (first_logical_dim=",


In this case we can output dbias = zero tensor also right instead of throwing an error?

timmoon10 · 2026-04-08T23:25:47Z

/te-ci

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-04-09T00:39:08Z

/te-ci

timmoon10 · 2026-04-11T00:30:57Z

/te-ci

Skip quantization kernels when tensor size is zero

d0081f0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 requested a review from Oleg-Goncharov April 8, 2026 03:47

timmoon10 added the bug Something isn't working label Apr 8, 2026

pre-commit-ci bot and others added 2 commits April 8, 2026 03:48

[pre-commit.ci] auto fixes from pre-commit.com hooks

a61dfe2

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/activations-with-zero-size-tensor

6df9853

Oleg-Goncharov reviewed Apr 8, 2026

View reviewed changes

Oleg-Goncharov previously approved these changes Apr 8, 2026

View reviewed changes

vthumbe1503 reviewed Apr 8, 2026

View reviewed changes

Merge branch 'main' into tmoon/activations-with-zero-size-tensor

43e685f

greptile-apps bot reviewed Apr 8, 2026

View reviewed changes

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh Outdated Show resolved Hide resolved

Use consistent early-termination logic in dbias kernels

ea6f858

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 dismissed Oleg-Goncharov’s stale review via ea6f858 April 9, 2026 00:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

e05f353

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/activations-with-zero-size-tensor

7ba6d67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip activation kernels when tensor size is zero#2848

Skip activation kernels when tensor size is zero#2848
timmoon10 wants to merge 7 commits intoNVIDIA:mainfrom
timmoon10:tmoon/activations-with-zero-size-tensor

timmoon10 commented Apr 8, 2026

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

greptile-apps bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Oleg-Goncharov Apr 8, 2026

Uh oh!

timmoon10 Apr 8, 2026

Uh oh!

Oleg-Goncharov left a comment

Uh oh!

vthumbe1503 Apr 8, 2026

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

Uh oh!

timmoon10 commented Apr 9, 2026

Uh oh!

timmoon10 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timmoon10 commented Apr 8, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

greptile-apps bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Oleg-Goncharov Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

Uh oh!

timmoon10 commented Apr 9, 2026

Uh oh!

timmoon10 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 8, 2026 •

edited

Loading