Fix PCRE2 UTF-8 validation errors for Qwen tokenizers #16631

Copilot · 2026-01-15T18:37:31Z

Summary

Qwen tokenizer patterns contain byte sequences that aren't valid UTF-8, causing PCRE2 compilation to fail with "UTF-8 error: isolated byte with 0x80 bit set". This prevents loading tokenizers for Qwen2.5 and OuteTTS models.

Changes:

pcre2_regex.cpp: Retry compilation without PCRE2_UTF flag when UTF-8 validation fails (checks all 21 PCRE2_ERROR_UTF8_ERR* codes)
std_regex.cpp: Catch regex_error exceptions in find_all() to prevent crashes from pattern complexity errors
test_utf8_patterns.cpp: Add tests validating invalid UTF-8 pattern handling and complex lookahead fallback

Pattern compilation now follows: RE2 → PCRE2 (with UTF) → PCRE2 (without UTF) → std::regex, with proper error handling at each stage.

Test plan

Built and ran tokenizers test suite:

cd extension/llm/tokenizers/test/build
cmake .. && make test_regex test_utf8_patterns
./test_regex        # 3/3 tests pass
./test_utf8_patterns # 2/2 tests pass

Log output confirms UTF-8 fallback works:

I tokenizers:pcre2_regex.cpp:58] PCRE2 UTF-8 validation failed at offset 1: UTF-8 error: overlong 2-byte sequence. Retrying without UTF flags.

Original prompt

This section details on the original issue you should resolve

<issue_title>Error [tokenizers:re2_regex.cpp:26] Failed to compile Regex for Qwen2.5 Model</issue_title>
<issue_description>### 🐛 Describe the bug

Similar issue as related in "Running Qwen3 in Android LlamaDemo App shows an error while loading tokenizer #11311 while trying to run the LLM Model (Qwen-2.5-0.5B) of OuteTTS v0.2 500M.

The tokenizer is a tokenizer.json file. I am using the one from OuteTTS-0.2-500M HuggingFace over here: https://huggingface.co/OuteAI/OuteTTS-0.2-500M/blob/main/tokenizer.json

Using Executorch git main branch (deaf37f)

Error as reported in #11311

I tokenizers:regex.cpp:27] Registering override fallback regex
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1756372464.977639 9813813 re2.cc:237] Error parsing '(\<\|2228\|\>|\<\|2229\|\>|\<\|2230\|\>|\<\|2231\|\>|\<\|2232\|\>|\<\|2233\|\>|\<\|2234\|\>|\<\|2235...': invalid UTF-8
E tokenizers:re2_regex.cpp:26] Failed to compile regex: (\<\|2228\|\>|\<\|2229\|\>|\<\|2230\|\>|\<\|2231\|\>|\<\|2232\|\>|\<\|2233\|\>|\<\|2234\|\>|\<\|2235\|\>|\<\|2236\|\>|\<\|2237\|\>|\<\|2238\|\>|\<\|2239\|\>|\<\|2240\|\>|\<\|2241\|\>|\<\|2242\|\>|\<\|2243\|\>|\<\|2244\|\>|\<\|224$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
E tokenizers:pcre2_regex.cpp:36] PCRE2 compilation failed at offset 275: UTF-8 error: isolated byte with 0x80 bit set
I tokenizers:regex_lookahead.cpp:40] PCRE2 failed to compile pattern, falling back to std::regex.
I tokenizers:hf_tokenizer.cpp:109] Setting up normalizer...
I tokenizers:normalizer.cpp:89] Using NFC normalizer. Please notice that our implementation may not handle all edge cases.
I tokenizers:hf_tokenizer.cpp:113] Normalizer set up
I tokenizers:hf_tokenizer.cpp:127] Setting up pretokenizer...
E0000 00:00:1756372465.044814 9813813 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s...': invalid perl operator: (?!
E tokenizers:re2_regex.cpp:26] Failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!

I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:131] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:147] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:185] Loaded 151387 BPE merge rules
I tokenizers:hf_tokenizer.cpp:193] Built merge ranks map with 151387 entries
libc++abi: terminating due to uncaught exception of type std::__1::regex_error: The complexity of an attempted match against a regular expression exceeded a pre-set level.
zsh: abort      ./cmake-out/examples/models/llama/llama_main --model_path=$OUTETTS_MODEL

Steps used to install ExecuTorch. Follow the models/llama/Readme.md Instructions :

# Download ExecuTorch
git clone https://github.com/pytorch/executorch.git && cd executorch

# Create a python virtual environment (python3.10) and activate it. 
python3.10 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Update and pull the submodules
git submodule sync
git submodule update --init --recursive

# Run the setup script to install executorch. Install llama dependencies.
./install_executorch.sh
./examples/models/llama/install_requirements.sh

# Build ExecuTorch 
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out
cmake --build cmake-out -j9 --target install --config Release

# Build the LLama-Runner with adding flag -DSUPPORT_REGEX_LOOKAHEAD=ON
cmake -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DSUPPORT_REGEX_LOOKAHEAD=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama
cmake --build cmake-out/examples/models/llama -j9 --config Release

Note: I do get a CMake Warning regarding the regex flag:

CMake Warning:
  Manually-specified variables were not used by the project:

    SUPPORT_REGEX_LOOKAHEAD

Command to test the OuteTTS LLM Model:

OUTETTS_MODEL="outetts_8da4wfp32_gs32_1024.pte"
TOKENIZER="OuteTTS-0.2-500M/tokenizer.json"
./cmake-out/examples/models/llama/llama_main --model_path=$OUTETTS_MODEL --tokenizer_path=$TOKENIZER --prompt="Welcome to Executorch framework."

Config file (.yaml) to convert the outetts llm model to .pte:

base:
  model_class: qwen2_5
  checkpoint: ../OuteTTS-0.2-500M/converted_weights.pth
  params: ../OuteTTS-0.2-500M/params.json
  metadata: '{"get_bos_id":151643, "get_eos_ids":151645}'

model:
  use_kv_cache: true
  use_sdpa_with_kv_cache: false
  dtype_override: fp32

export:
  max_seq_length: 1024
  max_context_length: 1024
  output_name: outetts_8da4wfp32_gs32_1024.pte

quantization:
  embedding_quantize: 8,0
  qmode: 8da4w
  group_size: 32

backend:
  xnnpack:
    enabled: true
    extended_ops: true

debug:
 ...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes pytorch/executorch#14432

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/pytorch/executorch/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

pytorch-bot · 2026-01-15T18:37:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16631

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3701213 with merge base 33974d5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-01-15T18:38:19Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Initial plan

3701213

Copilot AI assigned Copilot and kirklandsign Jan 15, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 15, 2026

Copilot started work on behalf of kirklandsign January 15, 2026 18:38 View session

Copilot AI changed the title ~~[WIP] Fix regex compilation error for Qwen2.5 model~~ Fix PCRE2 UTF-8 validation errors for Qwen tokenizers Jan 15, 2026

Copilot AI requested a review from kirklandsign January 15, 2026 19:00

Copilot finished work on behalf of kirklandsign January 15, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix PCRE2 UTF-8 validation errors for Qwen tokenizers #16631

Fix PCRE2 UTF-8 validation errors for Qwen tokenizers #16631

Uh oh!

Copilot AI commented Jan 15, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix PCRE2 UTF-8 validation errors for Qwen tokenizers #16631

Are you sure you want to change the base?

Fix PCRE2 UTF-8 validation errors for Qwen tokenizers #16631

Uh oh!

Conversation

Copilot AI commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16631

✅ No Failures

Uh oh!

github-actions bot commented Jan 15, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 15, 2026 •

edited

Loading

pytorch-bot bot commented Jan 15, 2026 •

edited

Loading

This PR needs a `release notes:` label