Go to content

Beyond Shuffling: Scaling Apache Spark by Holden Karau

This video was recorded at Scala Days Berlin 2016 Follow us on Twitter @ScalaDays or visit our website for more information http://scaladays.org Abstract: This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate. The planned topics are: * Using Spark counters for performance investigation * Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI. * Working with Key/Value Data * Replacing groupByKey for awesomeness groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations. * Effective caching & checkpointing * Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact. * Considerations for noisy clusters * Functional transformations with Spark Datasets *How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code

June 13, 2016