Add chunk_gated_delta_rule triton kernel for CUDA backend by mergennachin · Pull Request #18138 · pytorch/executorch

mergennachin · 2026-03-12T19:40:33Z

Registers FLA's chunk_gated_delta_rule as a @triton_op, following the
same pattern as the existing SDPA triton kernel. Six FLA triton kernels
are launched via wrap_triton() so AOTInductor compiles them directly
into the generated .so — no C++ shim needed.

Key trick: FLA kernels use @triton.heuristics which wrap_triton doesn't
support. We unwrap via kernel.fn to get the inner @triton.autotune
kernel and pass heuristic values (USE_G, IS_VARLEN, etc.) explicitly.

Requires: pip install flash-linear-attention

pytorch-bot · 2026-03-12T19:40:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18138

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 8 Unrelated Failures

As of commit 530ddb2 with merge base cc27e6b ():

NEW FAILURES - The following jobs have failed:

Build Presets / windows (pybind) / build (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Build Presets / windows (windows) / build (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 1
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 1aa80fa0fda5ac57f2fb489451a6cab0342cd694c58bad85762722bb918ab854 /exec failed with exit code 139
Test CUDA Builds / unittest-cuda / linux-job (gh)
RuntimeError: Command docker exec -t f386ab0cea5a2fedfd0e6c99c113bdeffaad8c6ea1b08416c8540a3d72d4d182 /exec failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (similar failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, diar_streaming_sortformer_4spk-v2, non-quantized) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-editable / windows / windows-job (gh) (trunk failure)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, parakeet-tdt, non-quantized) / windows-job (gh) (trunk failure)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, parakeet-tdt, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-12T19:42:58Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JacobSzwejbka · 2026-03-12T19:43:02Z

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

+
+
+@chunk_gated_delta_rule.register_fake
+def _chunk_gated_delta_rule_fake(


why fake instead of meta? whats the difference?

The SDPA kernel in this repo uses register_fake so I followed the same convention.

JacobSzwejbka · 2026-03-12T19:46:42Z

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

+CHUNK_SIZE = 64
+
+
+def _unwrap(kernel):


whats going on with the unwrap stuff?

Copilot

Pull request overview

Adds an ExecuTorch CUDA Triton custom op wrapper for Flash-Linear-Attention (FLA)’s chunk_gated_delta_rule, plus end-to-end validation via Python export tests and a small C++ runner that executes the exported .pte with the CUDA delegate in CI.

Changes:

Introduces triton::chunk_gated_delta_rule as a @triton_op, wrapping multiple FLA Triton kernels via wrap_triton() for AOTInductor compilation.
Adds Python tests to validate eager correctness vs FLA and to export/lower the op to an ExecuTorch program.
Adds a C++ e2e runner and wires CI to install FLA, run tests, export a model, build the runner, and execute it.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
backends/cuda/triton/kernels/chunk_gated_delta_rule.py	Registers the new Triton op and launches the underlying FLA kernels via `wrap_triton()`.
backends/cuda/triton/kernels/init.py	Conditionally imports/registers the new op when FLA is available.
backends/cuda/tests/test_chunk_gated_delta_rule.py	Adds eager + export/lower tests for the new op (skipped when FLA isn’t installed).
backends/cuda/tests/chunk_gated_delta_runner/main.cpp	Adds a minimal C++ program to load and run the exported model with CUDA delegate.
backends/cuda/tests/chunk_gated_delta_runner/CMakeLists.txt	Adds build configuration for the new C++ runner.
.github/workflows/cuda.yml	Installs FLA and runs the new Python + C++ e2e coverage in CUDA CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

backends/cuda/tests/test_chunk_gated_delta_rule.py

.github/workflows/cuda.yml

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

Copilot · 2026-03-12T19:48:55Z

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

+        V=V,
+        BT=BT,
+        BK=64,
+        BV=64,
+        USE_G=True,


BK and BV are hard-coded to 64 when launching recompute_w_u_fwd_kernel, which implicitly constrains supported head/value dims. Either derive these launch-time constexprs from K/V (if the FLA kernels support it) or validate/document the required K/V constraints in the @triton_op docstring and input checks to prevent confusing shape-dependent failures.

backends/cuda/tests/test_chunk_gated_delta_rule.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

backends/cuda/tests/test_chunk_gated_delta_rule.py

backends/cuda/tests/chunk_gated_delta_runner/CMakeLists.txt

backends/cuda/tests/chunk_gated_delta_runner/main.cpp

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

backends/cuda/tests/test_chunk_gated_delta_rule.py

Copilot · 2026-03-12T21:50:29Z

backends/cuda/tests/test_chunk_gated_delta_rule.py

+
+    pte_path = os.path.join(output_dir, "chunk_gated_delta.pte")
+    with open(pte_path, "wb") as f:
+        f.write(et_program.buffer)


export_chunk_gated_delta() writes the program using et_program.buffer, which forces materializing the entire serialized program into a contiguous bytes object. ExecuTorchProgramManager explicitly recommends write_to_file() to avoid extra copies and reduce peak memory. Consider switching to et_program.write_to_file(f) here.

Suggested change

f.write(et_program.buffer)

et_program.write_to_file(f)

backends/cuda/tests/test_chunk_gated_delta_rule.py

backends/cuda/tests/chunk_gated_delta_runner/main.cpp

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

backends/cuda/triton/kernels/__init__.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

backends/cuda/triton/kernels/__init__.py

backends/cuda/triton/kernels/chunk_gated_delta_rule.py

Registers FLA's chunk_gated_delta_rule as a @triton_op, following the same pattern as the existing SDPA triton kernel. Six FLA triton kernels are launched via wrap_triton() so AOTInductor compiles them directly into the generated .so — no C++ shim needed. Key trick: FLA kernels use @triton.heuristics which wrap_triton doesn't support. We unwrap via kernel.fn to get the inner @triton.autotune kernel and pass heuristic values (USE_G, IS_VARLEN, etc.) explicitly. Requires: pip install flash-linear-attention

Gasoonjia · 2026-03-12T23:13:44Z

backends/cuda/tests/test_chunk_gated_delta_rule.py

+        return o, final_state
+
+
+def _make_inputs_from_fla(


maybe we need to test nan and inf and -inf here, follow what we learned before.

mergennachin requested review from kirklandsign and larryliu0820 as code owners March 12, 2026 19:40

Copilot AI review requested due to automatic review settings March 12, 2026 19:40

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2026

mergennachin requested review from Gasoonjia, JacobSzwejbka and digantdesai March 12, 2026 19:41

JacobSzwejbka reviewed Mar 12, 2026

View reviewed changes

Copilot started reviewing on behalf of mergennachin March 12, 2026 19:43 View session

JacobSzwejbka reviewed Mar 12, 2026

View reviewed changes

Copilot AI reviewed Mar 12, 2026

View reviewed changes

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 0550e5a to 43ee833 Compare March 12, 2026 20:45

Copilot AI review requested due to automatic review settings March 12, 2026 21:16

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 43ee833 to 8cd3f9c Compare March 12, 2026 21:16

Copilot started reviewing on behalf of mergennachin March 12, 2026 21:21 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 8cd3f9c to 7d132d1 Compare March 12, 2026 21:31

Copilot AI review requested due to automatic review settings March 12, 2026 21:41

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 7d132d1 to 4f37bc7 Compare March 12, 2026 21:41

Copilot started reviewing on behalf of mergennachin March 12, 2026 21:43 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 4f37bc7 to 97beeeb Compare March 12, 2026 22:07

Copilot AI review requested due to automatic review settings March 12, 2026 22:08

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 97beeeb to 6ee0408 Compare March 12, 2026 22:08

Copilot started reviewing on behalf of mergennachin March 12, 2026 22:09 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

backends/cuda/triton/kernels/__init__.py Outdated Show resolved Hide resolved

backends/cuda/triton/kernels/chunk_gated_delta_rule.py Show resolved Hide resolved

mergennachin force-pushed the mergennachin/fla_linear_attention branch from 6ee0408 to 21c5dd7 Compare March 12, 2026 22:34

Merge branch 'main' into mergennachin/fla_linear_attention

530ddb2

Gasoonjia reviewed Mar 12, 2026

View reviewed changes



		@chunk_gated_delta_rule.register_fake
		def _chunk_gated_delta_rule_fake(

Conversation

mergennachin commented Mar 12, 2026

Uh oh!

pytorch-bot bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18138

❌ 5 New Failures, 8 Unrelated Failures

Uh oh!

github-actions bot commented Mar 12, 2026

This PR needs a release notes: label

Uh oh!

JacobSzwejbka Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

JacobSzwejbka Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Gasoonjia Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Mar 12, 2026 •

edited

Loading

This PR needs a `release notes:` label