Skip to content

feat: add 6 new converters, fix 7 bugs, add document stats utility#1769

Closed
AKIB473 wants to merge 2 commits intomicrosoft:mainfrom
AKIB473:maishad/enhanced-features
Closed

feat: add 6 new converters, fix 7 bugs, add document stats utility#1769
AKIB473 wants to merge 2 commits intomicrosoft:mainfrom
AKIB473:maishad/enhanced-features

Conversation

@AKIB473
Copy link
Copy Markdown

@AKIB473 AKIB473 commented Apr 14, 2026

Summary

This PR adds significant new functionality to MarkItDown: 6 new file format converters, 7 bug fixes across existing converters, and a new document stats utility.


🐛 Bug Fixes (7)

1. _ipynb_converter.py — Cell outputs now rendered

Cell outputs (stdout, execute_result, display_data, errors/tracebacks) were completely ignored. Now rendered as plain text or fenced code blocks.

2. _xlsx_converter.py — Empty sheets handled gracefully

Empty sheets previously produced a broken heading with no table. Now outputs _No data_ instead.

3. _zip_converter.py — Conversion failures surfaced as warnings

FileConversionException was silently swallowed. Now outputs > ⚠️ Could not convert file: {name} — {reason} so users know what was skipped.

4. _pptx_converter.py — Chart table separator fixed

Separator used |---|---| format which breaks some Markdown renderers. Changed to | --- | --- | (consistent with rest of codebase).

5. _audio_converter.py — Added .ogg, .flac, .aac support

These common audio formats were silently rejected. Added to both ACCEPTED_FILE_EXTENSIONS and ACCEPTED_MIME_TYPE_PREFIXES.

6. _wikipedia_converter.py — Infoboxes stripped by default

Wikipedia infoboxes, navboxes, and wikitables cluttered LLM-oriented output. Added strip_wikipedia_infoboxes=True kwarg (default: True) that removes .infobox, .wikitable, .navbox, .metadata elements.

7. _rss_converter.py — Item/entry links now extracted

RSS <link> text nodes and Atom <link href="..."> attributes were never included in output. Now rendered as **Link:** [url](url) per item/entry.


✨ New Converters (6)

TomlConverter.toml files

Converts pyproject.toml, Cargo.toml, config.toml etc. to structured Markdown with section headers and key-value lists.

SitemapConverter — XML sitemaps

Converts sitemap.xml (urlset and sitemapindex) to Markdown link lists with lastmod/priority metadata.

EnvConverter.env / dotenv files

Converts .env files to a Markdown table. Values are masked by default to prevent secrets leaking into LLM context. Pass show_values=True to reveal.

YamlConverter.yaml / .yml files

Converts YAML files to structured Markdown — dicts as ## Section headers, lists as bullet points, scalars as - **key:** value.

RequirementsConverterrequirements.txt / Pipfile

Converts Python dependency files to a Markdown table with columns: Package | Version Constraint | Notes (inline # comments extracted as notes).

HtmlConverter enhancement — OG/meta extraction

Added optional include_html_metadata=True kwarg to extract Open Graph tags, author, description, keywords as a YAML front-matter block.


🛠️ New Utility: get_document_stats()

Added get_document_stats(markdown_text: str) -> dict to converters module:

from markitdown.converters import get_document_stats

stats = get_document_stats(result.markdown)
# {'word_count': 342, 'char_count': 2187, 'line_count': 45,
#  'heading_count': 5, 'code_block_count': 3, 'link_count': 8, 'image_count': 2}

Also available as DocumentConverterResult.stats() method directly on any conversion result.


🧪 Tests

  • 75 tests added across 2 test files
  • All passing: 75 passed in 1.86s
  • Full coverage of every new converter (accepts/rejects + output correctness) and every bug fix

Akibuzzaman Akib added 2 commits April 14, 2026 11:30
… metadata

New converters:
- TomlConverter: converts pyproject.toml, Cargo.toml, config.toml etc. to structured Markdown
- SitemapConverter: converts XML sitemaps (urlset + sitemapindex) to Markdown link lists
- EnvConverter: converts .env files to redacted Markdown table (values masked by default)

Bug fixes:
- CsvConverter: escape pipe chars and collapse newlines in cells to prevent broken Markdown tables

Enhancements:
- HtmlConverter: add opt-in HTML metadata extraction (OG tags, author, description, keywords)
  via include_html_metadata=True kwarg

Tests: 16 new tests added covering all changes (all passing)
Bug fixes:
- ipynb: render cell outputs (stdout, execute_result, errors)
- xlsx: handle empty sheets gracefully
- zip: surface FileConversionException as warning in output
- pptx: fix chart table markdown separator format
- audio: add .ogg/.flac/.aac support
- wikipedia: strip infoboxes/navboxes by default (strip_wikipedia_infoboxes kwarg)
- rss: extract and include item/entry link URLs

New features:
- YamlConverter: convert .yaml/.yml to structured Markdown
- get_document_stats(): word/char/heading/code/link/image counts
- DocumentConverterResult.stats() method
- RequirementsConverter: convert requirements.txt/Pipfile to Markdown table

Tests: all new tests passing
@AKIB473 AKIB473 force-pushed the maishad/enhanced-features branch from cfc281b to 7e7a98f Compare April 14, 2026 11:51
@AKIB473
Copy link
Copy Markdown
Author

AKIB473 commented Apr 14, 2026

@microsoft-github-policy-service agree

@AKIB473
Copy link
Copy Markdown
Author

AKIB473 commented Apr 16, 2026

Closing this PR to resubmit as smaller, focused pull requests — one per converter/fix — to make review easier. Will follow up with separate PRs shortly.

@AKIB473 AKIB473 closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant