Go to content

Scala and the JVM as a Big Data Platform Lessons from Apache Spark by Dean Wampler

This video was recorded at Scala Days Berlin 2016 follow us on Twitter @ScalaDays or visit our website for more information http://scaladays.org Abstract: Spark is implemented in Scala and its user-facing Scala API is very similar to Scala's own Collections API. The power and concision of this API are bringing many developers to Scala. The core abstractions in Spark have created a flexible, extensible platform for applications like streaming, SQL queries, machine learning, and more. Scala's uptake reflects the following advantages over Java: A pragmatic balance of object-oriented and functional programming. An interpreter mode, which allows the same sort of exploratory programming that Data Scientists have enjoyed with Python and other languages. Scala-centric "Notebooks" are also now available. A rich Collections library that enables composition of operations for concise, powerful code. Tuples are naturally expressed in Scala and very convenient for working with data. Pattern Matching makes data "deconstruction" fast and intuitive. Type inference provides safety, feedback to the developer, yet minimal, manual typing of actual type signatures. Scala idioms lend themselves to the construction of small domain specific languages, which are useful for building libraries that are concise and intuitive for domain experts. There are disadvantages, too, which we'll discuss. Spark, like almost all open-source, Big Data tools, leverages the JVM, which is an excellent, general-purpose platform for scalable computing. However, its management of objects is suboptimal for high-performance data crunching. The way objects are organized in memory and the subsequent impact that has on garbage collection can be improved for the special case of Big Data. Hence, the Spark project has recently started a project called "Tungsten" to build internal optimizations using the following techniques: * Custom data layouts that use memory very efficiently with cache-awareness. * Manual memory management, both on-heap and off-heap, to minimize garbage and GC pressure. * Code generation to create optimal implementations of certain, heavily-used expressions from user code. This talk discusses the strengths and weaknesses of Scala and the JVM for Big Data, Spark in particular, and how we might improve both to make them better tools for our needs.

June 13, 2016