-
Notifications
You must be signed in to change notification settings - Fork 801
Qualcomm AI Engine Direct - Enable GA Static Gemma2-2B model #16624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary:
Enable one GA model: Static Gemma2-2B
- e2e script for GA Static Gemma2-2B perf: 16a4w block quant token rate in kv mode: ~= 34.864699 tokens/sec (SM8650) acc: PPL ~= (fp:9.608 -> htp:11.446) in wikitext dataset
- add Gemma2 2B instruct model params config
- remove qk-norm related parts
- add soft capping in two places: attention module and after model output
- update README with Gemma2 End-to-End example
- add unit test for Gemma2-2B
- add parameters in class MultiScopeAwareLlamaModel to support static llama architecture required by Gemma2
- used params: conv2d: 16a4w_block, atten_WV: 16a8w, block_size: 32, num_sharding=4
Test plan:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SN} -H ${HOST} -m ${CHIPID} --temperature 0 --model_mode kv --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma2-2b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16624
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 Cancelled Jobs, 1 Unrelated FailureAs of commit 34b58dd with merge base f680623 ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
This PR is to enable GA static Gemma2-2B instruct. Please have a look. Thanks |
|
@cccclai @jethroqti test-static-llama-qnn-linux job started failing after this PR cc @rascani |
|
@jethroqti see #16684 |
Follow-up to #16624 where it didn't `git add` the necessary folder `examples/models/gemma2`. As a result, it didn't have the necessary config and conversion script when landing the PR Our CI is timing out due to ``` # test_qnn_delegate.py:6430-6433 p = subprocess.Popen(cmds, stdout=subprocess.DEVNULL) with Listener((self.ip, self.port)) as listener: conn = listener.accept() # ← BLOCKS forever waiting for connection where it already had failed with not found module p.communicate() ``` https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-static-llama-qnn-linux&mergeEphemeralLF=true
|
Oh sorry I should pay closer attention to the CI - didn't catch the timeout issue and thank you for the forward fix |
Got it. I am handling now. Sorry for this. |
Qualcomm AI Engine Direct - Enable GA Static Gemma2-2B model
Summary:
Test plan:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SN} -H ${HOST} -m ${CHIPID} --temperature 0 --model_mode kv --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma2-2b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1