PDF to Text, a Challenging Problem

https://news.ycombinator.com/rss Hits: 6
Summary

The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months.Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format.It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”. These glyphs may be rotated, overlap, and appear out of order, with very little semantic information attached to them.You should probably be in awe at the fact that you can open a PDF file in your favorite viewer (or browser), hit ctrl+f, and search for text.Vertical, rotated text, next to horizontal text.Meanwhile the search engine preferrably wants clean HTML as input.The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundreds of gigabytes of PDF files off a single server with no GPU.Thankfully this isn’t a completely unexplored problem, it was possible to start off with PDFBox’ PDFTextStripper class, which sort of solves the problem, but with a lot of limitations that means it isn’t quite suitable fo the search engine’s needs, as it does as it says on the box, extract the text from a PDF with no regards for headings or other semantics, which are incredibly important relevance signals.Following are some of the modifications made to provide a pdf-to-text extraction that is better suited to the search engine’s needs.Identifying headingsA simple way we can look for headings is to seek a semibold or heavier line of text that is isolated from other text, this works when headings are bolded, but not all headings are bolded!Excerpt of the first page of "Can Education be Standardized? Evidence from Kenya", working versionAs we see in the example above, many headings instead rely on font size instead.This poses a problem, as f...

First seen: 2025-05-13 15:31

Last seen: 2025-05-13 20:32