Delta Lake Now Hosted by the Linux Foundation to Become the Open Standard for Data Lakes
At today’s Spark + AI Summit Europe in Amsterdam, we announced that Delta Lake is becoming a Linux Foundation project. Together with the community, the project aims to establish an open standard for managing large amounts of data in data lakes. The Apache 2.0 software license remains unchanged. Delta Lake focuses on improving the reliability...
Introducing Brickchain: Planet-scale Unified Analytics
Today we are excited to announce Brickchain, the next generation technology for zettabyte-scale analytics, by harnessing all the compute power on the planet. Brickchain is the most scalable, secure, and collaborative data technology ever invented. As you may know, Databricks was founded by the original creators of Apache Spark, a unified analytics engine that uses...
Introducing Apache Spark 2.4
UPDATED: 11/19/2018 We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its...
Benchmarking Apache Spark on a Single Node Machine
Apache Spark has become the de facto unified analytics engine for big data processing in a distributed environment. Yet we are seeing more users choosing to run Spark on a single machine, often their laptops, to process small to large data sets, than electing a large Spark cluster. This choice is primarily because of the...
Introducing Apache Spark 2.3
Today we are happy to announce the availability of Apache Spark 2.3.0 on Databricks as part of its Databricks Runtime 4.0. We want to thank the Apache Spark community for all their valuable contributions to Spark 2.3 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone...
Meltdown and Spectre: Exploits and Mitigation Strategies
In an earlier blog post, we analyzed the performance impact of Meltdown and Spectre on big data workloads in the cloud. In this blog post, we explain these exploits, their mitigation strategies and how they impact Databricks from a security and performance perspective. Meltdown Meltdown breaks a fundamental assumption in operating system security: an...
Meltdown and Spectre’s Performance Impact on Big Data Workloads in the Cloud
Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre, were released. These exploits enable cross-VM and cross-process attacks by allowing untrusted programs to scan other programs’ memory. On Databricks, the only place where users can execute arbitrary code is in the virtual machines that run Apache Spark clusters. There,...
Databricks Cache Boosts Apache Spark Performance
We are excited to announce the general availability of Databricks Cache, a Databricks Runtime feature as part of the Unified Analytics Platform that can improve the scan speed of your Apache Spark workloads up to 10x, without any application code change. In this blog, we introduce the two primary focuses of this new feature: ease-of-use...
Benchmarking Big Data SQL Platforms in the Cloud
For a deeper dive on these benchmarks, watch the webinar featuring Reynold Xin. Performance is often a key factor in choosing big data platforms. Given SQL is the lingua franca for big data analysis, we wanted to make sure we are offering one of the most performant SQL platforms in our Unified Analytics Platform. In...
A Vision for Making Deep Learning Simple
When MapReduce was introduced 15 years ago, it showed the world a glimpse into the future. For the first time, engineers at Silicon Valley tech companies could analyze the entire Internet. MapReduce, however, provided low-level APIs that were incredibly difficult to use, and as a result, this "superpower" was a luxury — only a small...