Introducing Apache Spark 2.3
Today we are happy to announce the availability of Apache Spark 2.3.0 on Databricks as part of its Databricks Runtime 4.0. We want to thank the Apache Spark community for all their valuable contributions to Spark 2.3 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone...
Meltdown and Spectre: Exploits and Mitigation Strategies
In an earlier blog post, we analyzed the performance impact of Meltdown and Spectre on big data workloads in the cloud. In this blog post, we explain these exploits, their mitigation strategies and how they impact Databricks from a security and performance perspective. Meltdown Meltdown breaks a fundamental assumption in operating system security: an...
Meltdown and Spectre’s Performance Impact on Big Data Workloads in the Cloud
Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre, were released. These exploits enable cross-VM and cross-process attacks by allowing untrusted programs to scan other programs’ memory. On Databricks, the only place where users can execute arbitrary code is in the virtual machines that run Apache Spark clusters. There,...
Databricks Cache Boosts Apache Spark Performance
We are excited to announce the general availability of Databricks Cache, a Databricks Runtime feature as part of the Unified Analytics Platform that can improve the scan speed of your Apache Spark workloads up to 10x, without any application code change. In this blog, we introduce the two primary focuses of this new feature: ease-of-use...
Benchmarking Big Data SQL Platforms in the Cloud
For a deeper dive on these benchmarks, watch the webinar featuring Reynold Xin. Performance is often a key factor in choosing big data platforms. Given SQL is the lingua franca for big data analysis, we wanted to make sure we are offering one of the most performant SQL platforms in our Unified Analytics Platform. In...
A Vision for Making Deep Learning Simple
When MapReduce was introduced 15 years ago, it showed the world a glimpse into the future. For the first time, engineers at Silicon Valley tech companies could analyze the entire Internet. MapReduce, however, provided low-level APIs that were incredibly difficult to use, and as a result, this "superpower" was a luxury — only a small...
Top 5 Reasons for Choosing S3 over HDFS
At Databricks, our engineers guide thousands of organizations to define their big data and cloud strategies. When migrating big data workloads to the cloud, one of the most commonly asked questions is how to evaluate HDFS versus the storage systems provided by cloud providers, such as Amazon’s S3, Microsoft’s Azure Blob Storage, and Google’s Cloud...
Databricks Runtime 3.0 Beta Delivers Cloud Optimized Apache Spark
A major value Databricks provides is the automatic provisioning, configuration, and tuning of clusters of machines that process data. Running on these machines are the Databricks Runtime artifacts, which include Apache Spark and additional software such as Scala, Python, DBIO, and DBES. For customers these artifacts provide value: they relieve them from the onus of...
Processing a Trillion Rows Per Second on a Single Machine: How Can Nested Loop Joins be this Fast?
This blog post describes our experience debugging a failing test case caused by a cross join query running “too fast.” Because the root cause of fail test case spans across multiple layers—from Apache Spark to the JVM JIT compiler— we wanted to share our analysis in this post. Spark as a compiler The vast majority...
Databricks and Apache Spark 2016 Year in Review
In 2016, Apache Spark released its second major version 2.0 and outgrew our wildest expectations: 4X growth in meetup members reaching 240,000 globally, and 2X growth in code contributors reaching 1000. In addition to contributing to the success of Spark, Databricks also had a phenomenal year. We have rolled out a large number of features...