Watch now!

Webinar

How Apache Spark 3.0 and Delta Lake Enhance Data Lake Reliability

On-demand webinar

Apache Spark has become the de facto open source standard for big data processing for its ease of use and performance. The open source Delta Lake project improves Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

This helps to ensure that data lakes and data pipelines can deliver high quality and reliable data to downstream data teams for successful data analytics and machine learning projects.

Join us in this webinar to learn how Apache Spark 3.0 and Delta Lake enhances Data Lake reliability. We will also walk through updates in the Apache Spark 3.0.release as part of our new Databricks Runtime 7.0 Beta.

Topics to be covered including:

  • Apache Spark’s usage for big data processing
  • The evolution and technical challenges around data lake architectures
  • Delta Lake’s capabilities ensuring reliable data for Spark processing
  • Simplifying architectures with unified batch and streaming
  • The new Adaptive Query Execution (AQE) framework within Spark 3.0 can yield query performance gains. Based on a 3TB TPC-DS benchmark, two queries had more than a 1.5x speedup, and another 37 queries had more than 1.1x speedup.
  • With Dynamic Partition Pruning (DPP), we can significantly speed up performance by pruning partitions based on the joins between the fact and dimension tables common in star schema design.
  • Accelerator-aware Scheduling helps Spark take advantage of GPU and hardware accelerators for certain workloads (e.g deep learning). This release enhances the scheduler and makes the cluster manager accelerator-aware.
  • Spark 3.0 also introduces new Pandas UDF types and new Pandas function APIs for improved performance and usability.
  • Enhanced monitoring capabilities including the new UI for Structured Streaming, enhanced EXPLAIN command, and observable metrics.

Register now to learn more about the latest contributions from the Spark community for fast and scalable data processing, as well as how you can try them out today on Databricks for free.

Save your spot today!

Speaker

  • Denny Lee, Staff Developer Advocate