Rockhopper: A Robust Optimizer for Spark Configuration Tuning in Production Environment
- Yiwen Zhu ,
- Rathijit Sen ,
- Brian Kroth ,
- Sergiy Matusevych ,
- Andreas Mueller ,
- Tengfei Huang ,
- Rahul Challapalli ,
- Weihan Tang ,
- Xin He ,
- Mo Liu ,
- Estera Kot ,
- Sule Kahraman ,
- Arshdeep Sekhon ,
- Dario Bernal ,
- Aditya Lakra ,
- Shaily Fozdar ,
- Dhruv Relwani ,
- Rui Fang ,
- Long Tian ,
- Karuna Sagar Krishna ,
- Ashit Gosalia ,
- Carlo Curino ,
- Subru Krishnan
Companion of the 2025 International Conference on Management of Data |
Apache Spark, renowned for its scalability and ease of use, has become the standard for big data processing. However, optimizing Spark performance in production environments poses significant challenges. Traditional machine learning-based configuration tuning methods often necessitate extensive resources, lengthy experimentation, and risk performance regressions. Observational noise in production environments further complicates the tuning process, leading to suboptimal results. This paper presents an adaptive, robust learning approach leveraging insights from benchmark workloads to improve production tuning strategies. We propose a Centroid Learning algorithm resilient to noise, minimizing regressions and prioritizing promising configurations, combined with a workload embedding technique for context-aware adaptation and transfer learning. Evaluations using benchmark and customer workloads show consistent performance gains. Released in June 2024 as part of the Microsoft Fabric Spark offering, even with dynamic and evolving workloads, the system delivers approximately a 20% performance improvement in production for customer workloads by only tuning three query-level configurations.