Don't bother parsing: Just use images for RAG

https://news.ycombinator.com/rss Hits: 20
Summary

If you’ve ever tried to extract information from a complex PDF: one with charts, diagrams, and tables mixed with text, you know the pain. That invoice with a nested table showing quarterly breakdowns? The research paper whose intricate figures actually contain the key findings? The technical manual where the annotated diagrams explain more than the text ever could? Or maybe the IKEA manual with no text at all. We’ve all been there, watching our carefully crafted parsing pipeline mangle yet another document. The industry’s dirty secret is that we’re spending enormous effort (and money) on OCR, layout detection, and parsing pipelines that still lose the information that matters most. It’s like trying to “watch” a movie by reading its script: you miss all the visual storytelling that makes it meaningful. For example, let’s take a very simple all text page (not scanned, no diagrams, etc.) like the one below: Fig 1: Simple page showing Palantir financials If I try to parse it with common OCR tools (the values might come out correct, but the headings and values all get jumbled up, add on standard chunking, we might not send correct information when retrieving) Q1 Financials US commercial continues to accelerate in Q1 2025 alongside AIP revolution +71% Y/Y +65% Y/Y US Commercial Revenue US Commercial Customer Count +19% Q/Q +13% Q/Q US Commercial Revenue US Commercial Customer Count +127% Y/Y $810M US Commercial Remaining Deal Value 2x Y/Y US Commercial Total Contract Value +30% Q/Q US Commercial Deals Closed of $1M or Greater +183% Y/Y US Commercial Remaining Deal Value US Commercial Total Contract Value 2025 Palantir Technologies Inc. We dene a customer as an organization from which we have recognized revenue during the trailing twelve months period. This is possibly one of the simplest documents, we haven’t even come to complex or technical documents. Note/ Aside: You might still need to convert a document to text or a structured format, that’s essential for syncing inf...

First seen: 2025-07-21 19:38

Last seen: 2025-07-22 14:49