Neural audio codecs: how to get audio into LLMs

https://news.ycombinator.com/rss Hits: 25
Summary

Neural audio codecs: how to get audio into LLMs Václav VolhejnThank you for the valuable feedback on the drafts: Chung-Ming Chien, Moritz Boehle, Richard Hladík, Eugene Kharitonov, Patrick Perez, and Tom Sláma. I’d also like to thank the rest of the Kyutai team for the the research discussions without which this article could not exist. The plan: sandwich a language model in an audio encoder/decoder pair (=neural audio codec), allowing it to predict audio continuations. As of October 2025, speech LLMs suck. Many LLMs have voice interfaces, but they usually work by transcribing your speech, generating the answer in text, and using text-to-speech to read the response out loud. That’s perfectly fine in many cases (see Unmute), but it’s a wrapper, not real speech understanding. The model can’t hear the frustration in your voice and respond with empathy, it can’t emphasize important words in its answer, it cannot sense sarcasm, and so on. Yes, there are LLMs (Gemini, ChatGPT’s Advanced Voice Mode, Qwen, Moshi) that understand and generate speech natively. But in practice, they’re either not as smart, or they behave like text model wrappers. Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you. Clearly, speech LLMs lag behind text LLMs. But why? For text, we found out a few years ago that if you take a lot of text data, a big Transformer, and a lot of GPUs, you’ll get some pretty damn good text continuation models. Why can’t we just replace text with audio and get pretty damn good speech continuation models? As a teaser, here’s what happens when you try to do that naively (warning, loud): We’ll have a look at why audio is harder to model than text and how we can make it easier with neural audio codecs, the de-facto standard way of getting audio into and out of LLMs. With a codec, we can turn audio into larger discrete tokens, train models to predict continuations for these tokens, and then decod...

First seen: 2025-10-21 14:09

Last seen: 2025-10-22 15:24