How OpenElections Uses LLMs

https://news.ycombinator.com/rss Hits: 19
Summary

In the 12-plus years that we’ve been turning official precinct election results into data at OpenElections, the single biggest problem has been converting pictures of results into CSV files. Many of the precinct results files we get are image PDFs, and for those there are essentially two options: data entry or Optical Character Recognition. The former has some advantages, but not many. While most people are not great at manual repetitive tasks, you can improve with lots of practice, to the point where the results are very accurate. In the past we did pay for data entry services, and while we developed working relationships with two individuals in particular, the results almost always contained some mistakes and the cost could run into the hundreds of dollars pretty quickly. For a volunteer project, it just didn’t make sense. We also used commercial OCR software, most often Able2Extract, which did pretty well, but had a harder time with PDFs that had markings or were otherwise difficult to parse. Thankfully, most election results PDFs are in one of a small handful of formats, which makes things a bit less complicated, but commercial OCR has too many restrictions. For parsing image PDFs into CSV files, Google’s Gemini is my model of choice, for two main reasons. First, the results are usually very, very accurate (with a few caveats I’ll detail below), and second, Gemini’s large context window means it’s possible to work with PDF files that can be multiple MBs in size. Here are some examples using image PDFs from Texas counties of how OpenElections uses Gemini for its work. Limestone County The Limestone County file containing its 2024 general election results isn’t too bad for an image PDF: It has clear black text on a white background without markings. But two big issues make it hard for most OCR software to deal with: the two-column layout, with results from races on the left and the right; and those annoying dots between the end of candidate values and the vote tot...

First seen: 2025-06-19 17:08

Last seen: 2025-06-20 11:23