Voxtral – Frontier open source speech understanding models

https://news.ycombinator.com/rss Hits: 10

Summary

Voice: the original UI. Voice was humanity’s first interface—long before writing or typing, it let us share ideas, coordinate work, and build relationships. As digital systems become more capable, voice is returning as our most natural form of human-computer interaction. Yet today’s systems remain limited—unreliable, proprietary, and too brittle for real-world use. Closing this gap demands tools with exceptional transcription, deep understanding, multilingual fluency, and open, flexible deployment. We release the Voxtral models to accelerate this future. These state‑of‑the‑art speech understanding models are available in two sizes—a 24B variant for production-scale applications and a 3B variant for local and edge deployments. Both versions are released under the Apache 2.0 license. We have also made both models available on our API, and also provided a highly optimized transcription-only endpoint that delivers unparalleled cost-efficiency. Open, affordable, and production-ready speech understanding for everyone. Until recently, gaining truly usable speech intelligence in production meant choosing between two trade-offs: Open-source ASR systems with high word error rates and limited semantic understanding Closed, proprietary APIs that combine strong transcription with language understanding, but at significantly higher cost and with less control over deployment Voxtral bridges this gap. It offers state-of-the-art accuracy and native semantic understanding in the open, at less than half the price of comparable APIs. This makes high-quality speech intelligence accessible and controllable at scale. Both Voxtral models go beyond transcription with capabilities that include: Long-form context: with a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding Built-in Q&A and summarization: Supports asking questions directly about the audio content or generating structured summaries, without the need to chain separa...

First seen: 2025-07-15 23:05

Last seen: 2025-07-16 08:06

Read Full Article More from this Source

Voxtral – Frontier open source speech understanding models

Summary

Related News

Designing for the Eye: Optical Corrections in Architecture and Typography

LLM Daydreaming

Most (ly Dead) Influential Programming Languages (2020)

Encrypting Files with Passkeys and Age

LLM Inevitabilism