Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs

https://news.ycombinator.com/rss Hits: 3

Summary

Tokenflood Tokenflood is a load testing tool for instruction-tuned LLMs that allows you to run arbitrary load profiles without needing specific prompt and response data. Define desired prompt lengths, prefix lengths, output lengths, and request rates, and tokenflood simulates this workload for you. Tokenflood makes it easy to explore how latency changes when using different providers, hardware, quantizations, or prompt and output lengths. Tokenflood uses litellm under the hood and supports all providers that litellm covers. Caution Tokenflood can generate high costs if configured poorly and used with pay-per- token services. Make sure you only test workloads that are within a reasonable budget. See the safety section for more information. Table of Contents Common Usage Scenarios Load testing self-hosted LLMs. Assessing the effects of hardware, quantization, and prompt optimizations on latency, throughput, and costs. Assessing the intraday latency variations of hosted LLM providers for your load types. Assessing and choosing a hosted LLM provider before going into production with them. Example: Assessing the effects of prompt optimizations upfront Here is an example of exploring the effects of prompt parameters for latency and throughput. The following graphs depict different load scenarios. Together they show the impact of hypothetical improvements to the prompt parameters. The first graph represents the base case, our current prompt parameters: ~3000 input tokens, of which ~1000 are a common prefix that can be cached, and ~60 output tokens. In the graphs, you can see the mean latency, and the 50th, 90th and 99th percentile latency. These percentile lines indicate the latency below which 50%, 90%, and 99% of LLM requests came in. When designing latency sensitive systems, it's important to have an understanding of the distribution and not just the average. At 3 requests per second, our system gives us a latency of around 1720ms for the 50th percentile, 2700ms for the...

First seen: 2025-11-18 20:51

Last seen: 2025-11-18 22:51

Read Full Article More from this Source

Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs

Summary

Related News

SSE sucks for transporting LLM tokens

Hacking Google Chrome Source Code: Make Puppeteer work over Redis PubSub

Photographer Built a Medium-Format Rangefinder, and So Can You

Fast, Memory-Efficient Hash Table in Java: Borrowing the Best Ideas

Computer Animator and Amiga fanatic Dick Van Dyke turns 100