Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

https://news.ycombinator.com/rss Hits: 6

Summary

Cerebras Breaks the 2,500 Tokens Per Second Barrier with Llama 4 Maverick 400BSUNNYVALE CA – May 28, 2025 -- Last week, Nvidia announced that 8 Blackwell GPUs in a DGX B200 could demonstrate 1,000 tokens per second (TPS) per user on Meta’s Llama 4 Maverick. Today, the same independent benchmark firm Artificial Analysis measured Cerebras at more than 2,500 TPS/user, more than doubling the performance of Nvidia’s flagship solution.“Cerebras has beaten the Llama 4 Maverick inference speed record set by NVIDIA last week,” said Micah Hill-Smith, Co-Founder and CEO of Artificial Analysis. "Artificial Analysis has benchmarked Cerebras' Llama 4 Maverick endpoint at 2,522 tokens per second, compared to NVIDIA Blackwell's 1,038 tokens per second for the same model. We’ve tested dozens of vendors, and Cerebras is the only inference solution that outperforms Blackwell for Meta’s flagship model.”With today’s results, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family. Artificial Analysis tested multiple other vendors, and the results were as follows: SambaNova 794 t/s, Amazon 290 t/s, Groq 549 t/s, Google 125 t/s, and Microsoft Azure 54 t/s. Andrew Feldman, CEO of Cerebras Systems, said, “The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency. These use cases often involve multi-step chains of thought or large-scale retrieval and planning, with generation speeds as low as 100 tokens per second on GPUs, causing wait times of minutes and making production deployment impractical. Cerebras has led the charge in redefining inference performance across models like Llama, DeepSeek, and Qwen, regularly delivering over 2,500 TPS/user.”With its world record performance, Cerebras is the optimal solution for Llama 4 in any deployment scenario. Not only is Cerebras Inference the first and onl...

First seen: 2025-05-31 07:26

Last seen: 2025-05-31 12:27

Read Full Article More from this Source

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

Summary

Related News

Using computers more freely and safely (2023)

A Dark Adtech Empire Fed by Fake CAPTCHAs

Kyber (YC W23) Is Hiring a Technical Account Manager

Show HN: Tattoy – a text-based terminal compositor

OxCaml - a set of extensions to the OCaml programming language.