Skip to main content
<
Page 2
>
Engineering blog

Apache Spark ❤️ Apache DataSketches: New Sketch-Based Approximate Distinct Counting

Introduction In this blog post, we'll explore a set of advanced SQL functions available within Apache Spark that leverage the HyperLogLog algorithm, enabling...
Engineering blog

Multiple Stateful Operators in Structured Streaming

August 7, 2023 by Angela Chu and Jungtaek Lim in Engineering Blog
In the world of data engineering, there are operations that have been used since the birth of ETL. You filter. You join. You...
Engineering blog

Seamlessly Migrate Your Apache Parquet Data Lake to Delta Lake

Apache Parquet is one of the most popular open source file formats in the big data world today. Being column-oriented, Apache Parquet allows...
Engineering blog

Unifying Your Data Ecosystem with Delta Lake Integration

As organizations are maturing their data infrastructure and accumulating more data than ever before in their data lakes, Open and Reliable table formats...
Engineering blog

Announcing Terraform Databricks modules

The Databricks Terraform provider reached more than 10 million installations, significantly increasing adoption since it became generally available less than one year ago...
Engineering blog

Processing data simultaneously from multiple streaming platforms using Delta Live Tables

One of the major imperatives of organizations today is to enable decision making at the speed of business. Business teams and autonomous decisioning...
Engineering blog

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0 . We extend...
Engineering blog

Building the Lakehouse for Healthcare and Life Sciences - Processing DICOM images at scale with ease

One of the biggest challenges in understanding patient health status and disease progression is unlocking insights from the vast amounts of semi-structured and...
Engineering blog

Build Reliable and Cost Effective Streaming Data Pipelines With Delta Live Tables’ Enhanced Autoscaling

This year we announced the general availability of Delta Live Tables (DLT) , the first ETL framework to use a simple, declarative approach...
Engineering blog

Memory Profiling in PySpark

There are many factors in a PySpark program's performance. PySpark supports various profiling tools to expose tight loops of your program and allow...