Andrew Bennett - Erlang Dist Filtering and the WhatsApp Runtime System
https://2023.elixirconf.com/presenters#speaker-andrew-bennett-elixirconf-us-2023 In this talk, we will shed light on WhatsApp’s journey of managing one of the world’s largest Erlang/OTP clusters and the challenges faced due to the inherent limitations of the Erlang Distribution Protocol. In our vast and intricate world of networked servers, the Erlang Distribution Protocol serves as the bedrock for clustering between Erlang and Elixir nodes, enabling the seamless communication and operation we enjoy today. However, with its original design not being inherently secure, the protocol can pose significant risks in larger-scale environments, where trust between servers becomes a question not only of capability but of security. We’ve seen this trust question evolve at WhatsApp. What started as a question of “Should these few hundred servers trust each other?” has, due to our monumental growth, turned into “Should these tens of thousands of servers trust each other?” With such a scale, even unintentional mistakes can lead to massive outages, creating a large blast radius due to the mesh network formed by the nodes. In the face of this challenge, we’ve developed a Native Implemented Function (NIF) named the “erldist_filter_nif” over the past year. This powerful tool assists in decoding dist packets on the receiving end, granting us the ability to either allow, drop, or redirect them for further inspection. This level of control will enable us to significantly reduce the blast radius a particular dist operation might have across the cluster. Join me in this session as I share our experiences and lessons learned and reveal how our innovations are pushing the boundaries of Erlang and Elixir in massive-scale production environments.