Skip to main content
Engineering blog

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

 Hadoop MR
Record
Spark
Record
Spark
1 PB
Data Size102.5 TB100 TB1000 TB
Elapsed Time72 mins23 mins234 mins
# Nodes2100206190
# Cores50400 physical6592 virtualized6080 virtualized
Cluster disk throughput3150 GB/s
(est.)
618 GB/s570 GB/s
Sort Benchmark Daytona RulesYesYesNo
Networkdedicated data center, 10Gbpsvirtualized (EC2) 10Gbps networkvirtualized (EC2) 10Gbps network
Sort rate1.42 TB/min4.27 TB/min4.27 TB/min
Sort rate/node0.67 GB/min20.7 GB/min22.5 GB/min

Named after Jim Gray, the benchmark workload is resource intensive by any measure: sorting 100 TB of data following the strict rules generates 500 TB of disk I/O and 200 TB of network I/O. Organizations from around the world often build dedicated sort machines (specialized software and sometimes specialized hardware) to compete in this benchmark.

Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs. In addition, it validates the work that we and others have been contributing to Spark over the past few years.

Since the inception of Databricks, we have devoted much effort to improve the scalability, stability and performance of Spark. This benchmark builds upon some of our major recent work in Spark, including sort-based shuffle (SPARK-2045), the new Netty-based transport module (SPARK-2468), and external shuffle service (SPARK-3796). The former has been released in Apache Spark 1.1, and the latter two will be part of the upcoming Apache Spark 1.2 release.

You can read our earlier blog post to learn more about our winning entry to the competition. Also expect future blog posts on these major new Spark features.

Finally, we thank Aaron Davidson, Norman Maurer, Andrew Wang, Min Zhou, the EC2 and EBS teams from Amazon Web Services, and the Spark community for their help along the way. We also thank the benchmark committee members Chris Nyberg, Mehul Shah, and Naga Govindaraju for their support.

Try Databricks for free

Related posts

Company blog

Running Apache Spark Clusters with Spot Instances in Databricks

October 25, 2016 by Sameer Farooqui in Company Blog
Using Amazon EC2 Spot instances to launch your Apache Spark clusters can significantly reduce the cost of running your big data applications in...
Engineering blog

Meltdown and Spectre's Performance Impact on Big Data Workloads in the Cloud

Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre , were released. These exploits enable cross-VM and cross-process...
Engineering blog

Transactional Writes to Cloud Storage on Databricks

In another blog post published today , we showed the top five reasons for choosing S3 over HDFS. With the dominance of simple...
See all Engineering Blog posts