Add topk Triton kernel for CUDA backend by mergennachin · Pull Request #18141 · pytorch/executorch

mergennachin · 2026-03-12T22:31:07Z

Add a Triton-based topk kernel that replaces aten.topk during graph
transformation, compiled directly into the AOTInductor .so via
wrap_triton (no C++ fallback shim needed).

The kernel uses iterative argmax with masking, adapted from
FlagGems/aiter. It is registered via @triton_op("triton::topk") and
auto-substituted for aten.topk.default through ReplaceEdgeOpWithTritonOpPass.

Tests follow the chunk_gated_delta_rule pattern: eager correctness
across 8 configs, export validation, and E2E C++ runner comparison.

pytorch-bot · 2026-03-12T22:31:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18141

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 8 Unrelated Failures

As of commit b29e89e with merge base cc27e6b ():

NEW FAILURES - The following jobs have failed:

Build Presets / windows (pybind) / build (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Build Presets / windows (windows) / build (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest-editable / windows / windows-job (gh) (similar failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, parakeet-tdt, non-quantized) / windows-job (gh) (similar failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, diar_streaming_sortformer_4spk-v2, non-quantized) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, parakeet-tdt, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-12T22:31:53Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds a Triton-based topk kernel for the ExecuTorch CUDA backend, replacing aten.topk.default during graph transformation. The kernel uses iterative argmax/argmin with masking and is registered via @triton_op.

Changes:

New Triton topk kernel implementation with iterative max/min and masking algorithm
Registration of the kernel in the edge-to-triton replacement pass
Tests (eager correctness, export validation, E2E C++ runner) and a dedicated C++ test runner

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
backends/cuda/triton/kernels/topk.py	New Triton topk kernel and its abstract/fake implementation
backends/cuda/triton/kernels/init.py	Export the new `topk` symbol
backends/cuda/triton/replacement_pass.py	Map `aten.topk.default` to the Triton kernel
backends/cuda/tests/test_topk.py	Eager correctness, export, and E2E tests
backends/cuda/tests/topk_runner/main.cpp	C++ runner for E2E testing
backends/cuda/tests/topk_runner/CMakeLists.txt	Build config for the C++ runner

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Add a Triton-based topk kernel that replaces aten.topk during graph transformation, compiled directly into the AOTInductor .so via wrap_triton (no C++ fallback shim needed). The kernel uses iterative argmax with masking, adapted from FlagGems/aiter. It is registered via @triton_op("triton::topk") and auto-substituted for aten.topk.default through ReplaceEdgeOpWithTritonOpPass. Tests follow the chunk_gated_delta_rule pattern: eager correctness across 8 configs, export validation, and E2E C++ runner comparison. This PR was authored with the assistance of Claude Code.

mergennachin requested review from kirklandsign and larryliu0820 as code owners March 12, 2026 22:31

Copilot AI review requested due to automatic review settings March 12, 2026 22:31

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2026

Copilot started reviewing on behalf of mergennachin March 12, 2026 22:31 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

mergennachin requested review from JacobSzwejbka and digantdesai March 12, 2026 22:35

mergennachin force-pushed the mergennachin/topk-triton-kernel branch from fb5d204 to 00165ab Compare March 12, 2026 22:50

Merge branch 'main' into mergennachin/topk-triton-kernel

b29e89e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add topk Triton kernel for CUDA backend#18141

Add topk Triton kernel for CUDA backend#18141
mergennachin wants to merge 2 commits intomainfrom
mergennachin/topk-triton-kernel

mergennachin commented Mar 12, 2026

Uh oh!

pytorch-bot bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mergennachin commented Mar 12, 2026

Uh oh!

pytorch-bot bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18141

❌ 2 New Failures, 8 Unrelated Failures

Uh oh!

github-actions bot commented Mar 12, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 12, 2026 •

edited

Loading

This PR needs a `release notes:` label