Skip to main content

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Apache Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

  Hadoop MR
Record
Spark
Record
Spark
1 PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk throughput 3150 GB/s
(est.)
618 GB/s 570 GB/s
Sort Benchmark Daytona Rules Yes Yes No
Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network virtualized (EC2) 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Named after Jim Gray, the benchmark workload is resource intensive by any measure: sorting 100 TB of data following the strict rules generates 500 TB of disk I/O and 200 TB of network I/O. Organizations from around the world often build dedicated sort machines (specialized software and sometimes specialized hardware) to compete in this benchmark.

Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs. In addition, it validates the work that we and others have been contributing to Spark over the past few years.

Since the inception of Databricks, we have devoted much effort to improve the scalability, stability and performance of Spark. This benchmark builds upon some of our major recent work in Spark, including sort-based shuffle (SPARK-2045), the new Netty-based transport module (SPARK-2468), and external shuffle service (SPARK-3796). The former has been released in Apache Spark 1.1, and the latter two will be part of the upcoming Apache Spark 1.2 release.

You can read our earlier blog post to learn more about our winning entry to the competition. Also expect future blog posts on these major new Spark features.

Finally, we thank Aaron Davidson, Norman Maurer, Andrew Wang, Min Zhou, the EC2 and EBS teams from Amazon Web Services, and the Spark community for their help along the way. We also thank the benchmark committee members Chris Nyberg, Mehul Shah, and Naga Govindaraju for their support.

Try Databricks for free

Related posts

Meltdown and Spectre's Performance Impact on Big Data Workloads in the Cloud

Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre , were released. These exploits enable cross-VM and cross-process...

Running Apache Spark Clusters with Spot Instances in Databricks

October 25, 2016 by Sameer Farooqui in
Using Amazon EC2 Spot instances to launch your Apache Spark clusters can significantly reduce the cost of running your big data applications in...

Transactional Writes to Cloud Storage on Databricks

In another blog post published today , we showed the top five reasons for choosing S3 over HDFS. With the dominance of simple...
See all Engineering Blog posts