Skip to content

Conversation

Copy link

Copilot AI commented Jan 15, 2026

Summary

Qwen tokenizer patterns contain byte sequences that aren't valid UTF-8, causing PCRE2 compilation to fail with "UTF-8 error: isolated byte with 0x80 bit set". This prevents loading tokenizers for Qwen2.5 and OuteTTS models.

Changes:

  • pcre2_regex.cpp: Retry compilation without PCRE2_UTF flag when UTF-8 validation fails (checks all 21 PCRE2_ERROR_UTF8_ERR* codes)
  • std_regex.cpp: Catch regex_error exceptions in find_all() to prevent crashes from pattern complexity errors
  • test_utf8_patterns.cpp: Add tests validating invalid UTF-8 pattern handling and complex lookahead fallback

Pattern compilation now follows: RE2 → PCRE2 (with UTF) → PCRE2 (without UTF) → std::regex, with proper error handling at each stage.

Test plan

Built and ran tokenizers test suite:

cd extension/llm/tokenizers/test/build
cmake .. && make test_regex test_utf8_patterns
./test_regex        # 3/3 tests pass
./test_utf8_patterns # 2/2 tests pass

Log output confirms UTF-8 fallback works:

I tokenizers:pcre2_regex.cpp:58] PCRE2 UTF-8 validation failed at offset 1: UTF-8 error: overlong 2-byte sequence. Retrying without UTF flags.
Original prompt

This section details on the original issue you should resolve

<issue_title>Error [tokenizers:re2_regex.cpp:26] Failed to compile Regex for Qwen2.5 Model</issue_title>
<issue_description>### 🐛 Describe the bug

Similar issue as related in "Running Qwen3 in Android LlamaDemo App shows an error while loading tokenizer #11311 while trying to run the LLM Model (Qwen-2.5-0.5B) of OuteTTS v0.2 500M.

The tokenizer is a tokenizer.json file. I am using the one from OuteTTS-0.2-500M HuggingFace over here: https://huggingface.co/OuteAI/OuteTTS-0.2-500M/blob/main/tokenizer.json

Using Executorch git main branch (deaf37f)

Error as reported in #11311

I tokenizers:regex.cpp:27] Registering override fallback regex
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1756372464.977639 9813813 re2.cc:237] Error parsing '(\<\|2228\|\>|\<\|2229\|\>|\<\|2230\|\>|\<\|2231\|\>|\<\|2232\|\>|\<\|2233\|\>|\<\|2234\|\>|\<\|2235...': invalid UTF-8
E tokenizers:re2_regex.cpp:26] Failed to compile regex: (\<\|2228\|\>|\<\|2229\|\>|\<\|2230\|\>|\<\|2231\|\>|\<\|2232\|\>|\<\|2233\|\>|\<\|2234\|\>|\<\|2235\|\>|\<\|2236\|\>|\<\|2237\|\>|\<\|2238\|\>|\<\|2239\|\>|\<\|2240\|\>|\<\|2241\|\>|\<\|2242\|\>|\<\|2243\|\>|\<\|2244\|\>|\<\|224$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
E tokenizers:pcre2_regex.cpp:36] PCRE2 compilation failed at offset 275: UTF-8 error: isolated byte with 0x80 bit set
I tokenizers:regex_lookahead.cpp:40] PCRE2 failed to compile pattern, falling back to std::regex.
I tokenizers:hf_tokenizer.cpp:109] Setting up normalizer...
I tokenizers:normalizer.cpp:89] Using NFC normalizer. Please notice that our implementation may not handle all edge cases.
I tokenizers:hf_tokenizer.cpp:113] Normalizer set up
I tokenizers:hf_tokenizer.cpp:127] Setting up pretokenizer...
E0000 00:00:1756372465.044814 9813813 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s...': invalid perl operator: (?!
E tokenizers:re2_regex.cpp:26] Failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!

I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:131] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:147] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:185] Loaded 151387 BPE merge rules
I tokenizers:hf_tokenizer.cpp:193] Built merge ranks map with 151387 entries
libc++abi: terminating due to uncaught exception of type std::__1::regex_error: The complexity of an attempted match against a regular expression exceeded a pre-set level.
zsh: abort      ./cmake-out/examples/models/llama/llama_main --model_path=$OUTETTS_MODEL

Steps used to install ExecuTorch. Follow the models/llama/Readme.md Instructions :

# Download ExecuTorch
git clone https://github.com/pytorch/executorch.git && cd executorch

# Create a python virtual environment (python3.10) and activate it. 
python3.10 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Update and pull the submodules
git submodule sync
git submodule update --init --recursive

# Run the setup script to install executorch. Install llama dependencies.
./install_executorch.sh
./examples/models/llama/install_requirements.sh

# Build ExecuTorch 
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out
cmake --build cmake-out -j9 --target install --config Release

# Build the LLama-Runner with adding flag -DSUPPORT_REGEX_LOOKAHEAD=ON
cmake -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DSUPPORT_REGEX_LOOKAHEAD=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama
cmake --build cmake-out/examples/models/llama -j9 --config Release

Note: I do get a CMake Warning regarding the regex flag:

CMake Warning:
  Manually-specified variables were not used by the project:

    SUPPORT_REGEX_LOOKAHEAD

Command to test the OuteTTS LLM Model:

OUTETTS_MODEL="outetts_8da4wfp32_gs32_1024.pte"
TOKENIZER="OuteTTS-0.2-500M/tokenizer.json"
./cmake-out/examples/models/llama/llama_main --model_path=$OUTETTS_MODEL --tokenizer_path=$TOKENIZER --prompt="Welcome to Executorch framework."

Config file (.yaml) to convert the outetts llm model to .pte:

base:
  model_class: qwen2_5
  checkpoint: ../OuteTTS-0.2-500M/converted_weights.pth
  params: ../OuteTTS-0.2-500M/params.json
  metadata: '{"get_bos_id":151643, "get_eos_ids":151645}'

model:
  use_kv_cache: true
  use_sdpa_with_kv_cache: false
  dtype_override: fp32

export:
  max_seq_length: 1024
  max_context_length: 1024
  output_name: outetts_8da4wfp32_gs32_1024.pte

quantization:
  embedding_quantize: 8,0
  qmode: 8da4w
  group_size: 32

backend:
  xnnpack:
    enabled: true
    extended_ops: true

debug:
 ...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes pytorch/executorch#14432

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/pytorch/executorch/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 15, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16631

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3701213 with merge base 33974d5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 15, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI changed the title [WIP] Fix regex compilation error for Qwen2.5 model Fix PCRE2 UTF-8 validation errors for Qwen tokenizers Jan 15, 2026
Copilot AI requested a review from kirklandsign January 15, 2026 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants