Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3
Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which they could issue queries as...
Introducing Stream-Stream Joins in Apache Spark 2.3
Since we introduced Structured Streaming in Apache Spark 2.0, it has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset. With the release of Apache Spark 2.3.0, now available in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we now support stream-stream joins. In this...
Announcing Machine Learning Model Export in Databricks
In recent years, machine learning has become ubiquitous in industry and production environments. Both academic and industry institutions had previously focused on training and producing models, but the focus has shifted to productionizing the trained models. Now we hear more and more machine learning practitioners really trying to find the right model deployment options. In...
Apache Spark 2.3 with Native Kubernetes Support
This is a community blog from Anirudh Ramanathan and Palak Bhatia, software engineer and product manager respectively at Google, working in the Kubernetes team. They are part of the group of companies that contributed to native Kubernetes support for the Apache Spark 2.3. This post is cross-posted on blog.kubernetes.io Kubernetes and Big Data The open...
Introducing Apache Spark 2.3
Today we are happy to announce the availability of Apache Spark 2.3.0 on Databricks as part of its Databricks Runtime 4.0. We want to thank the Apache Spark community for all their valuable contributions to Spark 2.3 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone...
Meltdown and Spectre: Exploits and Mitigation Strategies
In an earlier blog post, we analyzed the performance impact of Meltdown and Spectre on big data workloads in the cloud. In this blog post, we explain these exploits, their mitigation strategies and how they impact Databricks from a security and performance perspective. Meltdown Meltdown breaks a fundamental assumption in operating system security: an...
Meltdown and Spectre’s Performance Impact on Big Data Workloads in the Cloud
Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre, were released. These exploits enable cross-VM and cross-process attacks by allowing untrusted programs to scan other programs’ memory. On Databricks, the only place where users can execute arbitrary code is in the virtual machines that run Apache Spark clusters. There,...
Databricks Cache Boosts Apache Spark Performance
We are excited to announce the general availability of Databricks Cache, a Databricks Runtime feature as part of the Unified Analytics Platform that can improve the scan speed of your Apache Spark workloads up to 10x, without any application code change. In this blog, we introduce the two primary focuses of this new feature: ease-of-use...
The Architecture of the Next CERN Accelerator Logging Service
This is a community guest blog from Jakub Wozniak, a software engineer and project technical lead at CERN physics laboratory, further expounding and complementing his keynote at Spark Summit EU in Dublin. CERN is a physics laboratory founded in 1954 focused on research, technology, and education in the domain of Fundamental Physics and Standard Model...
Introducing Pandas UDF for PySpark
This is a guest community post from Li Jin, a software engineer at Two Sigma Investments, LP in New York. This blog is also posted on Two Sigma UPDATE: This blog was updated on Feb 22, 2018, to include some changes. This blog post introduces the Pandas UDFs (a.k.a. Vectorized UDFs) feature in the upcoming...