feat: add 6 new converters, fix 7 bugs, add document stats utility by AKIB473 · Pull Request #1769 · microsoft/markitdown

AKIB473 · 2026-04-14T11:46:47Z

Summary

This PR adds significant new functionality to MarkItDown: 6 new file format converters, 7 bug fixes across existing converters, and a new document stats utility.

🐛 Bug Fixes (7)

1. `_ipynb_converter.py` — Cell outputs now rendered

Cell outputs (stdout, execute_result, display_data, errors/tracebacks) were completely ignored. Now rendered as plain text or fenced code blocks.

2. `_xlsx_converter.py` — Empty sheets handled gracefully

Empty sheets previously produced a broken heading with no table. Now outputs _No data_ instead.

3. `_zip_converter.py` — Conversion failures surfaced as warnings

FileConversionException was silently swallowed. Now outputs > ⚠️ Could not convert file: {name} — {reason} so users know what was skipped.

4. `_pptx_converter.py` — Chart table separator fixed

Separator used |---|---| format which breaks some Markdown renderers. Changed to | --- | --- | (consistent with rest of codebase).

5. `_audio_converter.py` — Added .ogg, .flac, .aac support

These common audio formats were silently rejected. Added to both ACCEPTED_FILE_EXTENSIONS and ACCEPTED_MIME_TYPE_PREFIXES.

6. `_wikipedia_converter.py` — Infoboxes stripped by default

Wikipedia infoboxes, navboxes, and wikitables cluttered LLM-oriented output. Added strip_wikipedia_infoboxes=True kwarg (default: True) that removes .infobox, .wikitable, .navbox, .metadata elements.

7. `_rss_converter.py` — Item/entry links now extracted

RSS <link> text nodes and Atom <link href="..."> attributes were never included in output. Now rendered as **Link:** [url](url) per item/entry.

✨ New Converters (6)

`TomlConverter` — `.toml` files

Converts pyproject.toml, Cargo.toml, config.toml etc. to structured Markdown with section headers and key-value lists.

`SitemapConverter` — XML sitemaps

Converts sitemap.xml (urlset and sitemapindex) to Markdown link lists with lastmod/priority metadata.

`EnvConverter` — `.env` / dotenv files

Converts .env files to a Markdown table. Values are masked by default to prevent secrets leaking into LLM context. Pass show_values=True to reveal.

`YamlConverter` — `.yaml` / `.yml` files

Converts YAML files to structured Markdown — dicts as ## Section headers, lists as bullet points, scalars as - **key:** value.

`RequirementsConverter` — `requirements.txt` / `Pipfile`

Converts Python dependency files to a Markdown table with columns: Package | Version Constraint | Notes (inline # comments extracted as notes).

`HtmlConverter` enhancement — OG/meta extraction

Added optional include_html_metadata=True kwarg to extract Open Graph tags, author, description, keywords as a YAML front-matter block.

🛠️ New Utility: `get_document_stats()`

Added get_document_stats(markdown_text: str) -> dict to converters module:

from markitdown.converters import get_document_stats

stats = get_document_stats(result.markdown)
# {'word_count': 342, 'char_count': 2187, 'line_count': 45,
#  'heading_count': 5, 'code_block_count': 3, 'link_count': 8, 'image_count': 2}

Also available as DocumentConverterResult.stats() method directly on any conversion result.

🧪 Tests

75 tests added across 2 test files
All passing: 75 passed in 1.86s
Full coverage of every new converter (accepts/rejects + output correctness) and every bug fix

… metadata New converters: - TomlConverter: converts pyproject.toml, Cargo.toml, config.toml etc. to structured Markdown - SitemapConverter: converts XML sitemaps (urlset + sitemapindex) to Markdown link lists - EnvConverter: converts .env files to redacted Markdown table (values masked by default) Bug fixes: - CsvConverter: escape pipe chars and collapse newlines in cells to prevent broken Markdown tables Enhancements: - HtmlConverter: add opt-in HTML metadata extraction (OG tags, author, description, keywords) via include_html_metadata=True kwarg Tests: 16 new tests added covering all changes (all passing)

Bug fixes: - ipynb: render cell outputs (stdout, execute_result, errors) - xlsx: handle empty sheets gracefully - zip: surface FileConversionException as warning in output - pptx: fix chart table markdown separator format - audio: add .ogg/.flac/.aac support - wikipedia: strip infoboxes/navboxes by default (strip_wikipedia_infoboxes kwarg) - rss: extract and include item/entry link URLs New features: - YamlConverter: convert .yaml/.yml to structured Markdown - get_document_stats(): word/char/heading/code/link/image counts - DocumentConverterResult.stats() method - RequirementsConverter: convert requirements.txt/Pipfile to Markdown table Tests: all new tests passing

AKIB473 · 2026-04-14T12:24:43Z

@microsoft-github-policy-service agree

AKIB473 · 2026-04-16T06:01:59Z

Closing this PR to resubmit as smaller, focused pull requests — one per converter/fix — to make review easier. Will follow up with separate PRs shortly.

Akibuzzaman Akib added 2 commits April 14, 2026 11:30

AKIB473 force-pushed the maishad/enhanced-features branch from cfc281b to 7e7a98f Compare April 14, 2026 11:51

AKIB473 closed this Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add 6 new converters, fix 7 bugs, add document stats utility#1769

feat: add 6 new converters, fix 7 bugs, add document stats utility#1769
AKIB473 wants to merge 2 commits intomicrosoft:mainfrom
AKIB473:maishad/enhanced-features

AKIB473 commented Apr 14, 2026

Uh oh!

AKIB473 commented Apr 14, 2026

Uh oh!

AKIB473 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AKIB473 commented Apr 14, 2026

Summary

🐛 Bug Fixes (7)

1. _ipynb_converter.py — Cell outputs now rendered

2. _xlsx_converter.py — Empty sheets handled gracefully

3. _zip_converter.py — Conversion failures surfaced as warnings

4. _pptx_converter.py — Chart table separator fixed

5. _audio_converter.py — Added .ogg, .flac, .aac support

6. _wikipedia_converter.py — Infoboxes stripped by default

7. _rss_converter.py — Item/entry links now extracted

✨ New Converters (6)

TomlConverter — .toml files

SitemapConverter — XML sitemaps

EnvConverter — .env / dotenv files

YamlConverter — .yaml / .yml files

RequirementsConverter — requirements.txt / Pipfile

HtmlConverter enhancement — OG/meta extraction

🛠️ New Utility: get_document_stats()

🧪 Tests

Uh oh!

AKIB473 commented Apr 14, 2026

Uh oh!

AKIB473 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `_ipynb_converter.py` — Cell outputs now rendered

2. `_xlsx_converter.py` — Empty sheets handled gracefully

3. `_zip_converter.py` — Conversion failures surfaced as warnings

4. `_pptx_converter.py` — Chart table separator fixed

5. `_audio_converter.py` — Added .ogg, .flac, .aac support

6. `_wikipedia_converter.py` — Infoboxes stripped by default

7. `_rss_converter.py` — Item/entry links now extracted

`TomlConverter` — `.toml` files

`SitemapConverter` — XML sitemaps

`EnvConverter` — `.env` / dotenv files

`YamlConverter` — `.yaml` / `.yml` files

`RequirementsConverter` — `requirements.txt` / `Pipfile`

`HtmlConverter` enhancement — OG/meta extraction

🛠️ New Utility: `get_document_stats()`