Go to content

Apache Spark? If only it worked. by Marcin Szymaniuk

Do you have plans to start working with Apache Spark? Are you already working with Spark but you haven’t gotten the expected performance and stability and you are not sure where to look for a fix? Spark has a very nice API and it promises high performance for crunching large datasets. It’s really easy to write an app in Spark, unfortunately, it’s also easy to write one which doesn’t perform the way you would expect or just fails for no obvious reason. This talk will consist of multiple common problems one might face when running Spark at full scale and, of course, solutions for solving them. Each of the problems I will cover will come with well-described background and examples so it will be understood by people with no Spark experience. However, people who are working with Spark are the main audience. The ultimate objective is to give the audience a practical framework for optimizing most common problems with Spark applications. Class of problems in the presentation: * Dealing with skewed data * Spark on YARN and its memory model * Caching * Sizing executors * Locality Marcin Szymaniuk Data developer, Data infrastructure administrator, Consultant. Companies I was working for include: VRBO, Spotify, TrueCaller and most recently Apple. [KJD-8834]

November 7, 2016