Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses potential numeric overflow/invalid-range handling when parsing numeric character references, and adds regression coverage to ensure boundary behavior is correct.
Changes:
- Add regression tests for numeric character references at/above the Unicode max (U+10FFFF / U+110000) and for extremely long hex entities.
- Harden
XMLUtil::GetCharacterRefto reject out-of-range numeric references during accumulation to prevent wrap/overflow-based bypasses. - Remove an outdated build/formatting comment block from
tinyxml2.h.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
xmltest.cpp |
Adds security regression tests covering max-code-point and overflow-sized numeric entities. |
tinyxml2.cpp |
Adds in-loop bounds checks in numeric entity parsing and fixes an incorrect length pointer check. |
tinyxml2.h |
Removes an obsolete comment block with build/formatting notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Regression: U+10FFFF is the last valid Unicode code point and must | ||
| // parse correctly. Fix #2 must not reject the maximum valid value. | ||
| XMLDocument doc; | ||
| doc.Parse( "<t v=''/>" ); | ||
| XMLTest( "Numeric ref U+10FFFF: no error", false, doc.Error() ); | ||
| const char* v = doc.FirstChildElement()->Attribute( "v" ); | ||
| // U+10FFFF encodes to the 4-byte UTF-8 sequence F4 8F BF BF. | ||
| const char expected[] = { | ||
| static_cast<char>(0xF4), static_cast<char>(0x8F), | ||
| static_cast<char>(0xBF), static_cast<char>(0xBF), 0 | ||
| }; | ||
| XMLTest( "Numeric ref U+10FFFF: correct UTF-8 output", expected, v ); | ||
| } | ||
| { | ||
| // Fix #2 boundary: U+110000 is one above the maximum code point. | ||
| // The in-loop overflow guard must catch this before ucs is written, | ||
| // leaving the entity as a literal (starting with '&'). | ||
| XMLDocument doc; |
There was a problem hiding this comment.
The new test comments refer to "Fix #2", but there’s no local context explaining what #2 is (issue/PR/CVE), which makes the intent harder to understand later. Consider rewording these to reference the specific behavior being tested (e.g., “GetCharacterRef max code point handling”) or link to the relevant issue/CVE in the comment.
| const unsigned int digitScaled = mult * digit; | ||
| // Reject before adding: if digitScaled alone exceeds MAX_CODE_POINT, | ||
| // or if adding it to ucs would exceed it (checked without overflow by | ||
| // testing ucs > MAX_CODE_POINT - digitScaled, safe since digitScaled | ||
| // <= MAX_CODE_POINT at this point). | ||
| if (digitScaled > MAX_CODE_POINT || ucs > MAX_CODE_POINT - digitScaled) { | ||
| return 0; | ||
| } | ||
| ucs += digitScaled; |
There was a problem hiding this comment.
digitScaled is declared as unsigned int while mult, digit, ucs, and MAX_CODE_POINT are uint32_t. On platforms where unsigned int is narrower than 32 bits, digitScaled can truncate and bypass the overflow/range checks. Make digitScaled a uint32_t (or wider) to keep the math and comparisons in a consistent width.
Fix potential overflows.