Skip to content

Use optional PyMuPDF when PDF text appears truncated#1883

Open
MukundaKatta wants to merge 1 commit into
microsoft:mainfrom
MukundaKatta:codex/pdf-pymupdf-fallback
Open

Use optional PyMuPDF when PDF text appears truncated#1883
MukundaKatta wants to merge 1 commit into
microsoft:mainfrom
MukundaKatta:codex/pdf-pymupdf-fallback

Conversation

@MukundaKatta
Copy link
Copy Markdown

Summary\n- add a lazy PyMuPDF extraction path for PDFs where pdfminer/pdfplumber return suspiciously short text\n- only prefer PyMuPDF when it recovers substantially more text, preserving existing table/form formatting in normal cases\n- add focused regression tests for fallback selection and stream position preservation\n\nFixes #1870.\n\n## Tests\n- uv run --project packages/markitdown --extra all --with pytest pytest packages/markitdown/tests/test_pdf_memory.py -q\n- uv run --project packages/markitdown --extra all --with pytest pytest packages/markitdown/tests/test_pdf_memory.py packages/markitdown/tests/test_pdf_masterformat.py packages/markitdown/tests/test_pdf_tables.py -q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant