Data-centric Metaprogramming by Vlad Ureche
This video was recorded at Scala Days Berlin 2016 Follow us on Twitter @ScalaDays or visit our website for more information http://scaladays.org Anstract: We can compose data structures like LEGO bricks: a relational employee table can be modelled as a `Vector[Employee]`, where we use the standard `Vector` collection. Yet, few programmers know just how inefficient this is: iterating requires dereferencing a pointer for each employee and a good part of the memory is occupied by redundant bookkeeping information. Data-centric metaprogramming is a technique that allows developers to tweak how their data structures are stored in memory, thus improving performance. For example, we can use `Vector[Employee]` throughout the program, despite its inefficiency. Then, when performance starts to matter, we simply instruct the compiler how to store the `Vector[Employee]` more efficiently, using separate arrays (or vectors) for each component. In turn, the compiler uses this information to optimize our code, automatically switching to the improved memory layout for the `Vector[Employee]`. This makes premature optimization redundant: we write the code using our favorite abstractions and, only when necessary, we tune them after the fact. There are many usecases for data-centric metaprogramming. For example, applied to Spark, it can produce 40% speedups. The Scala compiler plugin that enables data-centric metaprogramming is developed at github.com/miniboxing/ildl-plugin and is documented on scala-ildl.org. This way we avoid the misinterpretation that the newly introduced whole-wold Dataset optimisation (URL: https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html) relies on the data-centric metaprogramming approach, which is not the case.