Debugging Distributed Systems by Donny Nadolny
Despite our best efforts, our systems fail. Sometimes it’s our fault - code that we wrote or bugs that we caused. But sometimes the fault is with systems that we rely on. ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems, like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed on us a lot. We will walk through the process of finding and fixing one cause of many of these failures. You will learn how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you wanted to know about TCP including an example of machines having a different view of the state of a TCP stream. Donny Nadolny is a developer at PagerDuty. He has been using Java for many years, becoming a Sun Certified Java Programmer (for Java 1.4) even before getting his drivers license, and is always interested in talking about distributed systems. [FWD-0632]