Load Test GlassFlow for ClickHouse: Real-Time Dedup at Scale

https://news.ycombinator.com/rss Hits: 5
Summary

Load Test GlassFlow for ClickHouse: Real-Time Deduplication at Scale By Ashish Bagri, Co-founder & CTO of GlassFlow TL;DR We tested GlassFlow on a real-world deduplication pipeline with Kafka and ClickHouse. It handled 55,00 records/sec published by Kafka and processed 9,000+ records/sec on a MacBook Pro, with sub-0.12ms latency. No crashes, no message loss, no disordering. Even with 20M records and 12 concurrent publishers, it remained robust. Want to try it yourself? The full test setup is open source: https://github.com/glassflow/clickhouse-etl-loadtest and the docs https://docs.glassflow.dev/load-test/setup Why this test? ClickHouse is incredible at fast analytics. But when building real-time pipelines from Kafka to ClickHouse, many teams run into the same issues: analytics results are incorrect or too delayed to support real-time use cases. The root cause? Data duplications and slow joins. They are often introduced by retries, offset reprocessing, or downstream enrichment. These problems can affect both correctness and performance. That’s why we built GlassFlow: A real-time streaming ETL engine designed to process Kafka streams before data hits ClickHouse. After launching the product, we often received the question, “How does it perform at high loads?” With this post, we want to give a clear and reproducible answer to that. This article walks through what we tested, how we set it up, and what we found when testing deduplications with GlassFlow. What is GlassFlow? GlassFlow is an open-source streaming ETL service developed specifically for ClickHouse. It is a real-time stream processing solution designed to simplify data pipeline creation and management between Kafka and ClickHouse. It supports: Real-time deduplication (configurable window, event ID based) Stream joins between topics Exactly-once semantics Native ClickHouse sink with efficient batching and buffering GlassFlow handles the hard parts: state, ordering, retries and batching. More about GlassFlow at ...

First seen: 2025-06-22 15:54

Last seen: 2025-06-22 19:56