Go to content

Scaling Up Spark at Facebook – a 60TB Production Use Case

Recorded at DataEngConf SF17 in April, 2017. This talk deep dives into how Facebook managed to convert a gigantic Hive batch processing job that uses 6000 CPU days to run on Spark with 1/4 CPU at 1/4 latency. To accomplish this, we made numerous stability and performance improvements to Apache Spark, tuned configurations and optimized our business logic. Nominated among the Top 10 blog posts of 2016 from Apache Spark, this talk describes the experiences and lessons learned while scaling Spark to replace one of Facebook's Hive workloads. Examples include taking one of the existing pipelines and migrating it to spark to enable fresher feature data, and improve manageability. This led to major realiability improvements including making the PipedRDD more robust to fetch failure gracefully, as well as a less disruptive cluster restart. In addition, performance optimizations were also made as part of the migration to spark such as reducing shuffle write latency which led to a CPU improve of up to 50% for jobs writing a high number of shuffle partitions.

April 26, 2017