Managing Thousands of Spark Workers in Cloud Environment

Download Slides

At DataVisor, we fight online fraud, abuse, and money laundering using unsupervised machine learning approach that clusters millions of users. In order to support the computationally intensive workload, DataVisor uses Spark as the mainstay of its computation infrastructure. The scalability and portability of our Spark infrastructure is critical to our company when we expand our business. In this talk, we will present our story of how we manage our Spark infrastructure at scale.

At peak time, we have 2000+ Spark workers online, and we group these workers into ~50 clusters of various size. The benefits of this, on one hand, is data isolation, which is critical to DataVisor as we are processing multi-customer data. On the other hand, this is for cost and performance consideration, as we want to provide just enough resources to each Spark application. When under-provision, Spark application will fail due to out-of-memory or out-of-disk. However we want to avoid unnecessary over-provision as it dramatically increases our cloud cost.

Next, we will present our DataVisor SparkGenerator (DSG), which is designed to automatically manage our Spark infrastructure. The responsibility of DSG includes (a) launching and shutting down Spark cluster, to maximize concurrency and minimize cost, (b) assigning Spark applications to the proper clusters intelligently, according to the Spark application profile, and (c) managing the dependency among Spark applications, to make our pipeline run smoothly and efficiently, and (d) running all of the Spark worker on Spot instances, reducing the cloud computation cost versus on-demand by over 80%.

Session hashtag: #HWCSAIS14

« back
About Yuhao Zheng

Yuhao Zheng currently works at DataVisor as a Tech Lead Manager in Infrastructure Team. His work includes building a reliable, scalable, and efficient storage and computing infrastructure, as well as building the real-time detection service. Prior to DataVisor, Yuhao Zheng worked in Android Connectivity team in Google. Dr Zheng received his Ph.D. degree from University of Illinois at Urbana-Champaign.

About Boduo Li

Boduo Li works at DataVisor as Senior Research Scientist, Infrastructure. His work focuses on improving the scalability and efficiency of DataVisor’s Spark computation platform. Prior to DataVisor, Boduo worked as a Researcher in NEC Labs America. He received his Ph.D. degree from University of Massachusetts Amherst.