Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

https://news.ycombinator.com/rss Hits: 21

Summary

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads TL;DR We’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+. Table of Contents Intro As LLMs get smarter, faster, and cheaper, the community keeps finding new ways to use them. Our own recent work has explored using models to scan every file in a codebase, sample 10,000 attempts for math and code problems, and collaborate with other models to minimize cloud costs. Inference is now also an important part of the training process, where we use models to generate synthetic data or as part of RL pipelines that generate and train on model completions. Crucially, these new inference workloads look quite different than the original LLM use case of serving a chatbot. Here, we care primarily about the total time and cost required to complete a large batch of sequences, and we care much less (if at all) about the individual latency of a single generation. In other words, we want high throughput! Open-source inference engines (i.e. dedicated systems for running efficient LLM inference) like FlexGen, vLLM, and SGLang have been enormously valuable to the community. Inspired by and learning from these projects, we built a new engine, Tokasaurus, designed from the ground up to handle throughput-focused workloads. We’ve optimized Tokasaurus for efficiently serving large and small models alike, allowing it to outperform existing engines on throughput benchmarks. In the rest of this blog, we’ll walk through some of these optimizations and show off a few settings where Tokasaurus really shines. Optimizing Small Mode...

First seen: 2025-06-05 22:03

Last seen: 2025-06-06 18:08

Read Full Article More from this Source

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

Summary

Related News

Why Are Smokestacks So Tall?

A masochist's guide to web development

Odyc.js – A tiny JavaScript library for narrative games

Meta: Shut Down Your Invasive AI Discover Feed. Now

Too Many Open Files