Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

https://news.ycombinator.com/rss Hits: 12

Summary

The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.✕What matters is having the inference optimization muscle to immediately push on latency and throughputOptimizing performance on a new model is a substantial engineering challenge. Thanks to our flexible inference stack and the collective expertise of our model performance engineering team, we are able to roll out performance improvements by the hour on new models.In fact, in the time it took to write this blog post, we added another 100 tokens per second while maintaining 100% uptime.✕OpenRouter performance for GPT OSS, 6:45 PM August 6, 2025 Model performance efforts included:Testing and benchmarking across inference frameworks (TensorRT-LLM, vLLM, and SGLang)Ensuring compatibility with Hopper and Blackwell GPU architecturesIntegrating with key pieces of our inference stack, including NVIDIA DynamoLayering in our favorite performance optimizations, like KV cache-aware routing and speculative decoding with EagleBelow are the steps we took to achieve our goal of SOTA performance with full context window support.Step 1: Running first inferenceThe first step is running baseline inference however possible. Running inference on a model requires support at the inference framework, hardware architecture, and model server level.Inspired by GPUs, we parallelized this effort across multiple engineers. One engineer tried vLLM, another SGLang, and a third worked on TensorRT-LLM. We were able to quickly get TensorRT-LLM working, which was fortunate as it is usually the most performant inference framework for LLMs.✕NVIDIA cut a dev release of TensorRT-LL...

First seen: 2025-08-07 04:21

Last seen: 2025-08-07 15:23

Read Full Article More from this Source

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

Summary

Related News

Caligra Workbench

The Magic of Herding

Cordoomceps – replacing an Amiga's brain with Doom

Empire of the Absurd: A Brief History of the Absurdities of the Soviet Union

Installing a Mini-Split AC in a Brooklyn Apartment