Hey HN! We're Shreyash and Bhavnick. We're building Chonkie (https://chonkie.ai), an open-source library for chunking and embedding data.Python: https://github.com/chonkie-inc/chonkieTypeScript: https://github.com/chonkie-inc/chonkie-tsHere's a video showing our code chunker: https://youtu.be/Xclkh6bU1P0.Bhavnick and I have been building personal projects with LLMs for a few years. For much of this time, we found ourselves writing our own chunking logic to support RAG applications. We often hesitated to use existing libraries because they either had only basic features or felt too bloated (some are 80MB+).We built Chonkie to be lightweight, fast, extensible, and easy. The space is evolving rapidly, and we wanted Chonkie to be able to quickly support the newest strategies. We currently support: Token Chunking, Sentence Chunking, Recursive Chunking, Semantic Chunking, plus:- Semantic Double Pass Chunking: Chunks text semantically first, then merges closely related chunks.- Code Chunking: Chunks code files by creating an AST and finding ideal split points.- Late Chunking: Based on the paper (https://arxiv.org/abs/2409.04701), where chunk embeddings are derived from embedding a longer document.- Slumber Chunking: Based on the "Lumber Chunking" paper (https://arxiv.org/abs/2406.17526). It uses recursive chunking, then an LLM verifies split points, aiming for high-quality chunks with reduced token usage and LLM costs.You can see how Chonkie compares to LangChain and LlamaIndex in our benchmarks: https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS....Some technical details about the Chonkie package: - ~15MB default install vs. ~80-170MB for some alternatives. - Up to 33x faster token chunking compared to LangChain and LlamaIndex in our tests. - Works with major tokenizers (transformers, tokenizers, tiktoken). - Zero external dependencies for basic functionality. - Implements aggressive caching and precomputation. - Uses running mean pooling for efficient semantic c...
First seen: 2025-06-09 16:18
Last seen: 2025-06-09 20:20