Skip to content

fix: RSS feed metadata leakage and PPTX table pipe escaping#1785

Open
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/rss-child-scoping-pptx-pipe-escape
Open

fix: RSS feed metadata leakage and PPTX table pipe escaping#1785
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/rss-child-scoping-pptx-pipe-escape

Conversation

@JasonOA888
Copy link
Copy Markdown

Summary

Two bugs found and fixed in the converter layer.

Bug 1: RSS/Atom metadata leakage

_get_data_by_tag_name() used getElementsByTagName() which recursively searches all descendants. When a channel/feed lacked its own <title> or <description>, the method would return the first matching element from an <item>/<entry> instead, causing:

  • Item titles appearing as the feed title
  • Item descriptions appearing as the feed description
  • Incorrect title in DocumentConverterResult

Fix: Added _get_direct_child_data() that only searches immediate child elements, and used it for channel/feed-level metadata. Also fixed UnboundLocalError when channel_title is None and channel_description is not.

Bug 2: PPTX table pipe escaping

_convert_table_to_markdown() used html.escape() on cell text, which does not escape pipe characters (|). When cell content contained |, the resulting markdown table had extra columns.

Fix: Added .replace("|", "\|") after html.escape().

Reproduction

RSS bug:

from markitdown import MarkItDown
import io

rss = b"""<?xml version="1.0"?>
<rss version="2.0"><channel>
<description>My Channel</description>
<item><title>Item Title</title><description>Item Desc</description></item>
</channel></rss>"""

md = MarkItDown()
result = md.convert_stream(io.BytesIO(rss), file_extension=".rss")
# Before: result.title == "Item Title" (wrong)
# After:  result.title is None (correct)

PPTX bug:

# PPTX with a table cell containing "|"
# Before: | Alice | Has a | pipe |  (3 columns, broken)
# After:  | Alice | Has a \| pipe |   (2 columns, correct)

Test plan

  • All 112 existing tests pass (6 failures are pre-existing missing outlook optional dep)
  • Manually verified both bugs with reproduction scripts

Two bugs fixed:

1. RSS/Atom converter: _get_data_by_tag_name() used getElementsByTagName()
   which recursively searches all descendants. This caused item/entry-level
   metadata (title, description) to leak into channel/feed-level fields
   when the channel/feed lacked its own title or description. Added
   _get_direct_child_data() that only searches immediate child elements.
   Also fixed UnboundLocalError when channel has no title.

2. PPTX converter: table cells containing pipe characters (|) were not
   escaped, producing broken markdown tables with extra columns. Added
   pipe escaping in _convert_table_to_markdown().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant