feat: add 6 new converters, fix 7 bugs, add document stats utility#1769
Closed
AKIB473 wants to merge 2 commits intomicrosoft:mainfrom
Closed
feat: add 6 new converters, fix 7 bugs, add document stats utility#1769AKIB473 wants to merge 2 commits intomicrosoft:mainfrom
AKIB473 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
added 2 commits
April 14, 2026 11:30
… metadata New converters: - TomlConverter: converts pyproject.toml, Cargo.toml, config.toml etc. to structured Markdown - SitemapConverter: converts XML sitemaps (urlset + sitemapindex) to Markdown link lists - EnvConverter: converts .env files to redacted Markdown table (values masked by default) Bug fixes: - CsvConverter: escape pipe chars and collapse newlines in cells to prevent broken Markdown tables Enhancements: - HtmlConverter: add opt-in HTML metadata extraction (OG tags, author, description, keywords) via include_html_metadata=True kwarg Tests: 16 new tests added covering all changes (all passing)
Bug fixes: - ipynb: render cell outputs (stdout, execute_result, errors) - xlsx: handle empty sheets gracefully - zip: surface FileConversionException as warning in output - pptx: fix chart table markdown separator format - audio: add .ogg/.flac/.aac support - wikipedia: strip infoboxes/navboxes by default (strip_wikipedia_infoboxes kwarg) - rss: extract and include item/entry link URLs New features: - YamlConverter: convert .yaml/.yml to structured Markdown - get_document_stats(): word/char/heading/code/link/image counts - DocumentConverterResult.stats() method - RequirementsConverter: convert requirements.txt/Pipfile to Markdown table Tests: all new tests passing
cfc281b to
7e7a98f
Compare
Author
|
@microsoft-github-policy-service agree |
Author
|
Closing this PR to resubmit as smaller, focused pull requests — one per converter/fix — to make review easier. Will follow up with separate PRs shortly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds significant new functionality to MarkItDown: 6 new file format converters, 7 bug fixes across existing converters, and a new document stats utility.
🐛 Bug Fixes (7)
1.
_ipynb_converter.py— Cell outputs now renderedCell
outputs(stdout, execute_result, display_data, errors/tracebacks) were completely ignored. Now rendered as plain text or fenced code blocks.2.
_xlsx_converter.py— Empty sheets handled gracefullyEmpty sheets previously produced a broken heading with no table. Now outputs
_No data_instead.3.
_zip_converter.py— Conversion failures surfaced as warningsFileConversionExceptionwas silently swallowed. Now outputs> ⚠️ Could not convert file: {name} — {reason}so users know what was skipped.4.
_pptx_converter.py— Chart table separator fixedSeparator used
|---|---|format which breaks some Markdown renderers. Changed to| --- | --- |(consistent with rest of codebase).5.
_audio_converter.py— Added .ogg, .flac, .aac supportThese common audio formats were silently rejected. Added to both
ACCEPTED_FILE_EXTENSIONSandACCEPTED_MIME_TYPE_PREFIXES.6.
_wikipedia_converter.py— Infoboxes stripped by defaultWikipedia infoboxes, navboxes, and wikitables cluttered LLM-oriented output. Added
strip_wikipedia_infoboxes=Truekwarg (default: True) that removes.infobox,.wikitable,.navbox,.metadataelements.7.
_rss_converter.py— Item/entry links now extractedRSS
<link>text nodes and Atom<link href="...">attributes were never included in output. Now rendered as**Link:** [url](url)per item/entry.✨ New Converters (6)
TomlConverter—.tomlfilesConverts
pyproject.toml,Cargo.toml,config.tomletc. to structured Markdown with section headers and key-value lists.SitemapConverter— XML sitemapsConverts
sitemap.xml(urlset and sitemapindex) to Markdown link lists with lastmod/priority metadata.EnvConverter—.env/ dotenv filesConverts
.envfiles to a Markdown table. Values are masked by default to prevent secrets leaking into LLM context. Passshow_values=Trueto reveal.YamlConverter—.yaml/.ymlfilesConverts YAML files to structured Markdown — dicts as
## Sectionheaders, lists as bullet points, scalars as- **key:** value.RequirementsConverter—requirements.txt/PipfileConverts Python dependency files to a Markdown table with columns: Package | Version Constraint | Notes (inline
# commentsextracted as notes).HtmlConverterenhancement — OG/meta extractionAdded optional
include_html_metadata=Truekwarg to extract Open Graph tags, author, description, keywords as a YAML front-matter block.🛠️ New Utility:
get_document_stats()Added
get_document_stats(markdown_text: str) -> dicttoconvertersmodule:Also available as
DocumentConverterResult.stats()method directly on any conversion result.🧪 Tests
75 passed in 1.86s