650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark

https://news.ycombinator.com/rss Hits: 18
Summary

I recently tried to light the tinder for what I hoped would be a revolt — the Single Node Rebellion — but, of course, it sputtered out immediately. Truth be told, it was one of the most popular articles I’ve written about in some time, purely based on the stats.The fact that I even sold t-shirts, tells me I have born a few acolytes into this troubled Lake House world.Without rehashing the entire article, it’s clear that there is what I would call “cluster fatigue.” We all know it, but never talk about it … much … running SaaS Lake Houses is expensive emotionally and financially. All well and good during the peak Covid days when we had our mini dot-com bubble, but the air has gone out of that one.Not only is it not cheap to crunch 650 GB of data on a Spark cluster —piling up DBUs, truth be told — but it’s not complicated either; they’ve made it easy to spend money. Especially when you simply don’t need a cluster anymore for *most datasets and workloads.Sure, in the days of Pandas when that was our only non-Spark option, we didn’t have a choice, but DuckDB, Polars, and Daft (also known as D.P.D. because why not) … have laid that argument to rest in a shallow grave.Sometimes I feel like I must overcome skepticism with a little bit of show-and-tell, proof is in the pudding, as they say. If you want proof, I will provide it.Look, it ain’t always easy, but always rewarding.Thanks for reading Data Engineering Central! This post is public so feel free to share it.ShareWe have two options on the table. Like Neo, you have to choose which pill to take. Ok, maybe you can take both pills, but whatever.DistributedNot-DistributedOur minds have been overrun by so much marketing hype pumping into our brains, we are like Neo stuck in The Matrix. We just need some help to escape.I’m going to shove that red pill down your throat. Open up, buttercup.Into the breach, my friends, let’s get to it.Thanks for reading Data Engineering Central! This post is public so feel free to share it.Shar...

First seen: 2025-11-13 23:49

Last seen: 2025-11-14 16:52