The Concurrency Trap: How an Atomic Counter Stalled a Pipeline

https://news.ycombinator.com/rss Hits: 4
Summary

On February 2nd, Conviva’s streaming analytics platform suddenly ground to a crawl but only for one customer. P99 latency spiked without clear reason, pushing our DAG engine to its limits. What started as a puzzling slowdown soon became a deep dive into concurrency pitfalls. Conviva’s platform is built to handle 5 trillion daily events, powered by a DAG (directed acyclic graph) based analytics engine. Each customer’s logic is compiled into a DAG, running concurrently on a custom actor model built atop Tokio. This post unpacks how a seemingly innocuous atomic counter in a shared type registry became the bottleneck and what we learned about concurrency, cache lines, and the right data structures for the job. If you use Rust at scale, or plan to, you’ll enjoy this. Setting The Stage We initially tried debugging the issue by eliminating the obvious causes – watermarking, inaccurate metrics etc. Traffic from gateway showing the P99 latency spike There was some spirited discussion around whether the way the Tokio runtime was scheduling its tasks across physical threads was causing issues but that seemed improbable given that we use an actor system and each DAG processing task runs independently on a specific actor, and it was unlikely that multiple actors were being scheduled onto the same underlying physical thread. There were additional lines of inquiry around whether HDFS writes were what was causing the lag to build up and eventually causing a backpressure throughout the system. More analysis of more graphs showed increased context switching during the incident but still with no clear evidence of the cause. Analyzing The Evidence We were able to reproduce the issue by saving the event data to GCS buckets and replaying this in an environment enabled with perf. This was a relief because at least the issue wasn’t tied to the prod environment, which would have been a nightmare to debug. We track active sessions across our system, so we have a reasonable measure of how muc...

First seen: 2025-06-10 18:24

Last seen: 2025-06-10 21:24