Reproducing the deep double descent paper

https://news.ycombinator.com/rss Hits: 1
Summary

reproducing deep double descent 04 Jun, 2025 This summer, I've been at the recurse center intensively trying to catch up to the current state of the machine learning world. I don't have any prior background in ML, so I've been taking some classes and reading a lot of papers. Two weeks in, I now have some basic working knowledge and wanted to get my hands dirty. After reading the Deep Double Descent paper, I wanted to see if I understood enough to reproduce the results. In a previous post, I went over some notes about doing the training for this on a rental GPU, but I figured I'd go into details about the project itself. Please note the understanding here is still one of a student - if you spot something wrong, please send me a message! double descent backgroundFor a long time, the ML community thought that models could only be so big before they started degrading in accuracy. Around the start of the GPT era, folks realized that you could get better test-time results from a model just by training it for much, much longer. In 2019, folks at OpenAI and Harvard wrote a paper that tries to formalize this effect and also goes into how model size can impact results, i.e. model-wise double descent where bigger models are eventually better. The phrasing double descent refers to this behavior where error gets better at first, then peaks much worse, then eventually comes back down again. intuitionsModels are trained on a training set and then evaluated against a separate test set. When a model is really good at the training set but really bad at the test set, we say it generalized poorly. Imagine: the model memorized the multiple-choice answers on the homework but it doesn't help with taking the final exam. It's not super clear why this happens, but here's the rough intuition I came away from. With smaller models, the model can do its best to approximate the right behavior for test time but just doesn't have enough "brain cells" (parameters) to fit the whole problem in its hea...

First seen: 2025-06-05 22:03

Last seen: 2025-06-05 22:03