Beginners guide to balance your data across Apache Kafka partitions (Olena KUTSENKO)
Voxxed Days Luxembourg 2023 Room: Linux Type: Conference [MICROPHONE ISSUES SADLY] Apache Kafka is a distributed system. At the heart of Apache Kafka is a set of brokers that contain topics. Topics are split into partitions. Dividing topics into smaller pieces allows us to work with data in parallel and achieve higher data throughput. Such parallelization is the key to a performant cluster, however it comes with a price. First, reading from multiple partitions will eventually mess up the order of records, meaning that the resulting order will be different from when the data was pushed into the cluster. Another big challenge is uneven distribution of data across partitions. Overloaded partitions present a dangerous issue for performance of all involved parties, but especially for brokers and consumers. Therefore, when building our product architecture we should carefully weigh up how many partitions we need, how to ensure proper message ordering, how to balance records across partitions, not forgetting about data load distribution over time. And do all of this while still maintaining good performance of the cluster. If you're fresh to Apache Kafka, or looking for good practices to design your partitions and avoid common pitfalls, you'll find this session useful!