Victor Cuevas-Vicenttin is a postdoctoral researcher at the Barcelona Supercomputing Center, where he works on benchmarking and performance analyisis of big data analytics systems in cloud environments. His previous research experience addressed data provenance, query processing in dynamic environments, workflow systems, and logic programming. He worked in the DataONE project at the University of California at Davis and has been a full-time professor at two well-known mexican universities. He obtained a PhD in Computer Science from the University of Grenoble, France.
Today, users have multiple options for big data analytics in terms of open-source and proprietary systems as well as in cloud computing service providers. In order to obtain the best value for their money in a SaaS cloud environment, users need to be aware of the performance of each service as well as its associated costs, while also taking into account aspects such as usability in conjunction with monitoring, interoperability, and administration capabilities. We present an independent analysis of two mature and well-known data analytics systems, Apache Spark and Presto. Both running on the Amazon EMR platform, but in the case of Apache Spark, we also analyze the Databricks Unified Analytics Platform and its associated runtime and optimization capabilities. Our analysis is based on running the TPC-DS benchmark and thus focuses on SQL performance, which still is indispensable for data scientists and engineers. In our talk we will present quantitative results that we expect to be valuable for end users, accompanied by an in depth look into the advantages and disadvantages of each alternative. Thus, attendees will be better informed of the current big data analytics landscape and find themselves in a better position to avoid common pitfalls in deploying data analytics at a scale.