feat: support ring attention with arbitrary KV sequence lengths#13545
feat: support ring attention with arbitrary KV sequence lengths#13545songh11 wants to merge 4 commits intohuggingface:mainfrom
Conversation
|
Thanks for your PR. Could you also provide some visual output examples and comparison with other backends we support (latency, memory overhead, etc.)? |
@sayakpaul Thanks for the review! I just pushed two follow-up commits:
FLUX.1-dev benchmarks on 4 × RTX 4090 (48GB)
Visual outputsI'll attach side-by-side generated images (same prompt + seed: single-GPU baseline vs. |
|
Thanks for the updates -- they are very comprehensive. It seems like ulysses_anything is still the best but users who might prefer ring would want to use ring_anything for non-divisible shapes? Is that the motivation here? |
Thanks for the review! You're spot on — the primary motivation is to provide a seamless path for users on a ring topology to handle non-divisible sequence lengths, similar to how ulysses_anything works. While ulysses/ulysses_anything is often the go-to, I believe ring_anything is a valuable addition for a few key reasons:
So while ulysses/ulysses_anything remains the recommended choice for most, while ring_anything covers head-constrained or ring-first setups—focusing on correctness today, with clear headroom for P2P overlap in the next step. |
|
Your reasoning is spot on! Let's include them in the docs too. |
|
@claude could you do a review here? |
|
Claude finished @sayakpaul's task in 2m 29s —— View job Review of PR #13545: Ring Anything Attention
Overall this is a well-structured PR that cleanly mirrors the existing patterns. The core algorithm (pad → all-gather → slice → online-softmax merge) is correct, and the config validation + hook routing are consistent with Duplicate hook branches should be merged
if self.parallel_config.ulysses_anything:
return PartitionAnythingSharder.shard_anything(...)
elif self.parallel_config.ring_anything:
return PartitionAnythingSharder.shard_anything(...)Since the bodies are identical, these should be merged: if self.parallel_config.ulysses_anything or self.parallel_config.ring_anything:
return PartitionAnythingSharder.shard_anything(...)Same for the Incomplete docstring comment for
|
|
@sayakpaul Thanks! Addressed the feedback by merging duplicate hooks, adding the exclusivity check, and polishing comments. Pushed in the latest commit. |







What does this PR do?
Adds a new "Ring Anything" context-parallel attention mode that supports arbitrary
(non-evenly divisible) KV sequence lengths across ring-degree workers.
Motivation
Existing
TemplatedRingAttentionrequires KV to be equipartitioned across ranks,which is impractical for real-world workloads where per-rank sequence lengths can
differ (e.g., variable-length prompts, packed batches, token pruning). This PR
mirrors the existing
ulysses_anythingdesign but applies it to the ring path.Changes
ContextParallelConfig: addring_anythingflag with validation(
ring_degree > 1andulysses_degree == 1).TemplatedRingAnythingAttention: new autograd Function that_templated_context_parallel_attention: dispatch to the new class whenring_anythingis enabled.ContextParallelSplitHook: route throughPartitionAnythingSharder.shard_anythingwhen
ring_anythingis set.Reproducible example
Launch
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu @asomoza @sayakpaul