Skip to content

Conversation

@kashif
Copy link
Contributor

@kashif kashif commented Jan 16, 2026

What does this PR do?

We now have a unified Qwen prompt-embed handling to avoid creating attention masks when there is no padding (keeps flash SDPA eligible, removes unnecessary masks). This fixes the inference speed regression.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@kashif kashif mentioned this pull request Jan 16, 2026
6 tasks
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kashif kashif requested a review from sayakpaul January 16, 2026 15:10
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Could you update with the following things?

  • Shed light into what caused the speed regression
  • Add a test with masks in the compilation tests here
  • Do a before and after comparison in the outputs with the PR

@kashif
Copy link
Contributor Author

kashif commented Jan 16, 2026

will do!

encoder_attention_mask = torch.stack(
[torch.cat([u, u.new_zeros(max_seq_len - u.size(0))]) for u in attn_mask_list]
)
prompt_embeds, encoder_attention_mask = build_prompt_embeds_and_mask(split_hidden_states)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curiously, cant we just do

if encoder_attention_mask.all():
    encoder_attention_mask = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue was that this was giving trouble with torch.compile...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants