Lessons Learned Optimizing NoSQL for Apache Spark
How do you take a platform designed for large scale storage of unstructured key-value data and optimize it for the structured world of Spark? In this talk we'll look at the real world lessons learned integrating Riak, the distributed key-value NoSQL database, with Spark. This will cover both the challenges and solutions for integrating these tools. We'll also dive into more advanced topics we encountered while creating the open source Spark-Riak connector including: How to handle what is traditionally schema-less data across widely divergent use cases Using dynamic data mapping to efficiently bridge NoSQL data into the Spark world of RDD and DataFrames How to optimize performance by using advanced techniques such as parallel data extract and cluster’s coverage plan Real-world examples using Spark SQL and Spark Streaming for time series use cases Leveraging Riak’s built-in leader election service (LES) for Spark Master high availability (HA) that removes the need to use Apache Zookeeper