Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0
Last week, we had a fun Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee provided a recap of Delta Lake 0.7.0 and answered your Delta Lake questions. The theme for this AMA was the release of Delta Lake 0.7.0 coincided with the release of Apache Spark 3.0...
Interoperability between Koalas and Apache Spark
Koalas is an open source project which provides a drop-in replacement for pandas, enabling efficient scaling out to hundreds of worker nodes for everyday data science and machine learning. After over one year of development since it was first introduced last year, Koalas 1.0 was released. pandas is a Python package commonly used among data...
A look at the new Structured Streaming UI in Apache Spark 3.0
This is a guest community post from Genmao Yu, a software engineer at Alibaba. Structured Streaming was initially introduced in Apache Spark 2.0. It has proven to be the best platform for building distributed stream processing applications. The unification of SQL/Dataset/DataFrame APIs and Spark’s built-in functions makes it easy for developers to achieve their complex...
Allow Simple Cluster Creation with Full Admin Control Using Cluster Policies
What is a Databricks cluster policy? A Databricks cluster policy is a template that restricts the way users interact with cluster configuration. Today, any user with cluster creation permissions is able to launch an Apache Spark™ cluster with any configuration. This leads to a few issues: Administrators are forced to choose between control and flexibility....
Time Traveling with Delta Lake: A Retrospective of the Last Year
Try out Delta Lake 0.7.0 with Spark 3.0 today! It has been a little more than a year since Delta Lake became an open-source project as a Linux Foundation project. While a lot has changed over the last year, the challenges for most data lakes remain stubbornly the same - the inherent unreliability of data...
Customer Lifetime Value Part 1: Estimating Customer Lifetimes
Download the Customer Lifetimes Part 1 notebook to demo the solution covered below, and watch the on-demand virtual workshop to learn more. You can also go to Part 2 to learn how to estimate future customer spend. The biggest challenge every marketer faces is how to best spend money to profitably grow their brand....
Vectorized R I/O in Upcoming Apache Spark 3.0
R is one of the most popular computer languages in data science, specifically dedicated to statistical analysis with a number of extensions, such as RStudio addins and other R packages, for data processing and machine learning tasks. Moreover, it enables data scientists to easily visualize their data set. By using SparkR in Apache SparkTM, R...
Adaptive Abfrageausführung: Beschleunigen von Spark SQL zur Laufzeit
This is a joint engineering effort between the Databricks Apache Spark engineering team — Wenchen Fan, Herman van Hovell and MaryAnn Xue — and the Intel engineering team — Ke Jia, Haifeng Chen and Carson Wang. See the AQE notebook to demo the solution covered below Over the years, there’s been an extensive and continuous...
Schema Evolution in Merge Operations and Operational Metrics in Delta Lake
Try this notebook to reproduce the steps outlined below We recently announced the release of Delta Lake 0.6.0, which introduces schema evolution and performance improvements in merge and operational metrics in table history. The key features in this release are: Support for schema evolution in merge operations (#170) - You can now automatically evolve the...
Shrink Training Time and Cost Using NVIDIA GPU-Accelerated XGBoost and Apache Spark™ on Databricks
Guest Blog by Niranjan Nataraja and Karthikeyan Rajendran of Nvidia. Niranjan Nataraja is a lead data scientist at Nvidia and specializes in building big data pipelines for data science tasks and creating mathematical models for data center operations and cloud gaming services. Karthikeyan Rajendran is the lead product manager for NVIDIA’s Spark team. This blog...