Skip to content

Conversation

@jethroqti
Copy link
Contributor

Qualcomm AI Engine Direct - Enable GA Static Gemma2-2B model

Summary:

  • e2e script for GA Static Gemma2-2B perf: 16a4w block quant token rate in kv mode: ~= 34.864699 tokens/sec (SM8650) acc: PPL ~= (fp:9.608 -> htp:11.446) in wikitext dataset
  • add Gemma2 2B instruct model params config
  • remove qk-norm related parts
  • add soft capping in two places: attention module and after model output
  • update README with Gemma2 End-to-End example
  • add unit test for Gemma2-2B
  • add parameters in class MultiScopeAwareLlamaModel to support static llama architecture required by Gemma2
  • used params: conv2d: 16a4w_block, atten_WV: 16a8w, block_size: 32, num_sharding=4

Test plan:

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SN} -H ${HOST} -m ${CHIPID} --temperature 0 --model_mode kv --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma2-2b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1

Summary:
Enable one GA model: Static Gemma2-2B

- e2e script for GA Static Gemma2-2B perf: 16a4w block quant token rate in kv mode: ~= 34.864699 tokens/sec (SM8650) acc: PPL ~= (fp:9.608 -> htp:11.446) in wikitext dataset
- add Gemma2 2B instruct model params config
- remove qk-norm related parts
- add soft capping in two places: attention module and after model output
- update README with Gemma2 End-to-End example
- add unit test for Gemma2-2B
- add parameters in class MultiScopeAwareLlamaModel to support static llama architecture required by Gemma2
- used params: conv2d: 16a4w_block, atten_WV: 16a8w, block_size: 32, num_sharding=4

Test plan:

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SN} -H ${HOST} -m ${CHIPID} --temperature 0 --model_mode kv --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma2-2b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 15, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16624

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs, 1 Unrelated Failure

As of commit 34b58dd with merge base f680623 (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 15, 2026
@jethroqti
Copy link
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jan 15, 2026
@jethroqti
Copy link
Contributor Author

This PR is to enable GA static Gemma2-2B instruct. Please have a look. Thanks

@cccclai @haowhsu-quic @DannyYuyang-quic

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Jan 16, 2026

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D90815003.

@cccclai cccclai merged commit e725fe4 into pytorch:main Jan 16, 2026
150 of 156 checks passed
@mergennachin
Copy link
Contributor

mergennachin commented Jan 17, 2026

@mergennachin
Copy link
Contributor

@jethroqti see #16684

mergennachin added a commit that referenced this pull request Jan 17, 2026
Follow-up to #16624 where it
didn't `git add` the necessary folder `examples/models/gemma2`. As a
result, it didn't have the necessary config and conversion script when
landing the PR

Our CI is timing out due to

```
  # test_qnn_delegate.py:6430-6433
  p = subprocess.Popen(cmds, stdout=subprocess.DEVNULL)
  with Listener((self.ip, self.port)) as listener:
      conn = listener.accept()  # ← BLOCKS forever waiting for connection where it already had failed with not found module
      p.communicate()
```


https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-static-llama-qnn-linux&mergeEphemeralLF=true
@cccclai
Copy link
Contributor

cccclai commented Jan 17, 2026

Oh sorry I should pay closer attention to the CI - didn't catch the timeout issue and thank you for the forward fix

@jethroqti
Copy link
Contributor Author

@jethroqti see #16684

Got it. I am handling now. Sorry for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants