Go to content

Scala: The Unpredicted Lingua Franca for Data Science - by Andy Petrella & Dean Wampler

This talk was recorded at Scala Days New York, 2016. Follow along on Twitter @scaladays and on the website for more information http://scaladays.org/. Abstract: It was true that, until pretty recently, the language of choice to manipulate and to make sense out of the data for Data Scientists was mainly one of Python, R or Matlab. This lead to split in the communities and duplication of efforts in languages offering a similar set functionnaiity. Although, it was foreseen that Julia (for instance) could gather parts of these communities, an unexpected event happened: the amount of available data and the distributed technologies to handle them. Distributed technologies raised out of the blue by data engineer and most of them are using a convenient and easy to deploy platform, the JVM. In this talk, we’ll show how the Data Scientists are now part of an heterogeneous team that has to face many problems and have to work towards a global solution together. This is including a new responsibility to be productive and agile in order to have their work integrated into the platform. This is why technologies like Apache Spark is so important nowadays and is gaining this traction from different communities. And even though some binding are available to legacy languages there, all the creativity in new ways to analyse the data has to be done in scala. So that, the second part of this talk will introduce and summarize all the new methodologies and scientific advances in machine learning done Scala as the main language, rather than others. We’ll demonstrate that all using the right tooling for Data Scientists which is enabling interactivity, live reactivity, charting capabilities and robustness in Scala, something there were still missing from the legacy languages. Hence, the examples will be provided and shown in a fully productive and reproducible environment combining the Spark Notebook and Docker.

May 9, 2016