We Hit 100% GPU Utilization–and Then Made It 3× Faster by Not Using It

https://news.ycombinator.com/rss Hits: 4

Summary

We recently used Qwen3-Embedding-0.6B to embed millions of text documents while sustaining near-100% GPU utilization the whole way.That’s usually the gold standard that machine learning engineers aim for… but here’s the twist: in the time it took to write this blog post, we found a way to make the same workload 3× faster, and it didn’t involve maxing out GPU utilization at all. That story’s for another post, but first, here’s the recipe that got us to near-100%.The workloadHere at the Daft kitchen, the same order keeps coming in: “One fast, painless pipeline to get my documents into a vector database for retrieval!”Heard.We whipped up a sample workload that:1.Reads millions of text documents from S32.Chunks them into sentences using spaCy3.Compute embeddings with the state-of-the-art model Qwen3-Embedding-0.6B4.Mise en placeBefore starting, let’s install the required dependencies:1pip install "daft[ray]" turbopuffer torch sentence-transformers spacy accelerate transformers2python -m spacy download en_core_web_smYou’ll also need to configure access for the object store where you’ll read data from. We prepared a sample dataset on AWS S3.Import Dependencies and Configure ConstantsWe’ll then set the workload parameters:1import torch2import daft3from daft import col4 5NUM_GPU_NODES = 8 6NLP_MODEL_NAME = "en_core_web_sm" 7CHUNKING_PARALLELISM = 8 8EMBEDDING_MODEL_NAME = "Qwen/Qwen3-Embedding-0.6B" 9ENCODING_DIM = 1024 10BATCH_SIZE = 512 11SENTENCE_TRANSFORMER_BATCH_SIZE = 16 These parameters control resource allocation and processing efficiency. Adjust NUM_GPU_NODES based on your cluster size, and modify batch sizes based on your data and available GPU memory.Step 1: Chunk TextWhen creating embeddings, it's useful to split your text into meaningful chunks. Text is hierarchical and can be broken down at different levels: Document → Sections → Paragraphs → Sentences → Words → Characters. The chunking strategy to use depends on your use case.Chunking Strategies• Sentence-lev...

First seen: 2025-08-17 16:35

Last seen: 2025-08-23 09:34

Read Full Article More from this Source

We Hit 100% GPU Utilization–and Then Made It 3× Faster by Not Using It

Summary

Related News

Microsoft PowerToys

Show HN: TailGuard – Bridge your WireGuard router into Tailscale via a container

The Scam Called "You Don't Have to Remember Anything"

E-Paper Display Refresh Rate Reaches New Heights

PKM apps need to get better at resurfacing information