Beyond 50,000 Partitions: How Heroku Operates and Pushes the Limits of Kafka at Scale
Recorded at DataEngConf SF17 in April, 2017. At Heroku, we offer Apache Kafka as a service to a large number of distinct users, each with a varying number of use cases. Our goal is to provide this service to users while removing many of the operational headaches that comes with running infrastructure like this at scale. While we run a robust service today, it took a lot of effort to get here. Early on, we uncovered a variety of failures scenarios. One such problem surfaced when a large number of concurrent admin commands, such as topic or consumer group creation, would cause a cascading broker failure. Our users would then observe not only enormous latencies around these admin commands, but also see a considerable drop in throughput until our infrastructure automatically recovered. It was important for us to address this because it would only get worse as our user base grew. Once we overcame this hurdle, we discovered another problem later on where brokers would fail when a large number of users would constantly go above their quota. This was, in part, a result of our decision to leverage quotas to ensure our users could coexist and behave among each other. However, it turned out to be another case of cascading broker failures since the underlying quota implementation, based on traffic shaping, had a memory leak that caused out-of-memory errors one broker at a time. In this talk, we will continue with our stories above in greater detail and share more anecdotes of how we discovered and addressed interesting behaviors and gotchas as we scaled our infrastructure to ensure a robust and trustworthy service for our growing user base.