A crude visual of the pipeline we used to classify all of Hacker News - >40M stories, 10.7B input tokens! A Quick Primer As an engineer and private pilot, I often find myself enjoying the surprisingly high volume of aviation-related content on Hacker News. About a year and a half ago, I started wondering if language models could be used to answer the question of just how much of Hacker News pertains to aviation. SLMs I started pondering this question around the same time that LLMs became popular (2023). As a data practioner, they seemed like a great tool to perform nuanced classification tasks in scenarios lacking a labeled ground-truth dataset. But the larger models were too expensive and tooling was too unwieldy to use them for offline data processing tasks. In general, language models are an abstractions for data practitioners, allowing us to easily perform on-the-fly unstructured data analysis without the pains that come with traditional ML, even if it means at a higher cost (a data scientist/ML engineer's time is almost inarguably more valuable!) And increasingly so today, smaller pre-trained models are getting more performant and customizable -- and compute costs to run them keep decreasing. Inference cost viability is not nearly as much of a concern as appropriate tooling. Down the Rabbit Hole At Skysight, we've spent much of the past year optimizing an end-to-end tooling layer to solve these types of data-and-compute-intensive problems. We will have more specifics on that later, but feel free to reach out if you are looking for infrastructure to solve similar problems. The Pipeline Without further adieu, we'll explain the pipeline we used to perform our analysis (visualized at the top). Data Gathering and LLM Pre-Processing: Hacker News offers a free API that can be used to gather all historical posts. We highly parallelized the fetching of this data and stuck the contents in a Cloudflare R2 Bucket distributed across >900 Parquet files. This left us with abo...
First seen: 2025-04-16 21:19
Last seen: 2025-04-16 21:19