A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0
Apache Spark is a very popular tool for processing structured and unstructured data. When it comes to processing structured data, it supports many basic data types, like integer, long, double, string, etc. Spark also supports more complex data types, like the Date and Timestamp, which are often difficult for developers to understand. In this blog...
Introducing Apache Spark 3.0
We’re excited to announce that the Apache SparkTM 3.0.0 release is available on Databricks as part of our new Databricks Runtime 7.0. The 3.0.0 release includes over 3,400 patches and is the culmination of tremendous contributions from the open-source community, bringing major advances in Python and SQL capabilities and a focus on ease of use...
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
This is a joint engineering effort between the Databricks Apache Spark engineering team — Wenchen Fan, Herman van Hovell and MaryAnn Xue — and the Intel engineering team — Ke Jia, Haifeng Chen and Carson Wang. See the AQE notebook to demo the solution covered below Over the years, there’s been an extensive and continuous...
Now on Databricks: A Technical Preview of Databricks Runtime 7 Including a Preview of Apache Spark 3.0
Introducing Databricks Runtime 7.0 Beta We’re excited to announce that the Apache SparkTM 3.0.0-preview2 release is available on Databricks as part of our new Databricks Runtime 7.0 Beta. The 3.0.0-preview2 release is the culmination of tremendous contributions from the open-source community to deliver new capabilities, performance gains and expanded compatibility for the Spark ecosystem. Using...
How to Work with Avro, Kafka, and Schema Registry in Databricks
In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post,...
Apache Avro as a Built-in Data Source in Apache Spark 2.4
Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ open source project Avro Data...
Introducing Apache Spark 2.4
UPDATED: 11/19/2018 We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its...
Learn about Apache Spark’s Memory Model and Spark’s State in the Cloud
Since Apache Spark 1.6, as part of the Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of the modern hardware. This effort culminated in Apache Spark 2.0 with Catalyst optimizer and whole-stage code generation. Because Spark...
Cost Based Optimizer in Apache Spark 2.2
This is a joint engineering effort between Databricks’ Apache Spark engineering team (Sameer Agarwal and Wenchen Fan) and Huawei’s engineering team (Ron Hu and Zhenhua Wang) Apache Spark 2.2 recently shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values,...
Scalable Partition Handling for Cloud-Native Architecture in Apache Spark 2.1
Apache Spark 2.1 is just around the corner: the community is going through voting process for the release candidates. This blog post discusses one of the most important features in the upcoming release: scalable partition handling. Spark SQL lets you query terabytes of data with a single job. Often times though, users only want to...