We collected 10k hours of neuro-language data in our basement

https://news.ycombinator.com/rss Hits: 9
Summary

Over the last 6 months, we collected ~10k hours of data across thousands of unique individuals. As far as we know, this is the largest neuro-language dataset in the world.See here, here, here, here, and here (discussion only, no data available) for some of the larger datasets. See recent papers discussing the problem of small datasets here, here, and here. Why did we do this? We train thought-to-text models. That is, we train models to decode semantic content from noninvasive neural data. Here are some entirely zero-shot examples: The neural data is taken from the seconds leading up to but not including the time when the subject typed or spoke, meaning that the model detects an idea before the subject even compiles that idea down into words. Ground truth Model prediction (based ONLY on neural data) the room seemed colder there was a breeze even a gentle gust do you have a favorite app or website do you have any favorite robot then she smiled faintly and nodded she shrugged, hoping to look indifferent. All examples are zero-shot to new subjects, whom the model has never seen before. We'll write about the model in a future post. But before you can train a model that generalizes to new people, you need to get many thousands of hours of data. When we started, the existing datasets were either inapplicable or tiny. Most were in the low hundreds of hours (if that), and most had tens or, at a stretch, hundreds of subjects. So we got thousands of people to come wear headsets in our basement. This post is about how we collected our dataset鈥攚hat participants do, the hardware and software involved, and what we learned about operations and ML when we scaled it up. What participants actually do A participant comes in, signs a consent form, and sits down in a booth. A session manager fits a headset onto them and starts the session. Then, the participant has a freeform conversation with an LLM for two hours. Sessions vary. Some are listening and speaking with an LLM, and some are ...

First seen: 2025-12-08 19:26

Last seen: 2025-12-09 04:28