Building Robust Production Data Pipelines with Databricks Delta

Download Slides

Most data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.

Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.

This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.

WHAT YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment
– How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta

PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition

 

Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos


« back
About Joe Widen

Joe Widen is a Solutions Architect at Databricks. Joe leads the Performance and Delta SME horizontal initiatives along with making customers successful with the Databricks Unified Analytics Platform.  Joe has been working with Spark and more generally Hadoop for 5 years, with previous stops at Hortonworks and Capital One.

About Steven Yu

Steven is a Senior Solutions Architect who focuses on helping large enterprises modernize their data pipelines and execute their data lake strategies with Apache Spark using Databricks.  He has led multiple Spark workshops and spoken at meetups in the South West region, and continues to help train the growing Field Engineering team at Databricks. He works with customers across many industries and verticals, including media/entertainment, financial services, automotive, and gaming.  Steven's background is in Software Engineering, Data Engineering and Data Warehousing, with a strong focus on performance tuning with over 14 years of industry experience.

About Burak Yavuz

Burak Yavuz is a Software Engineer and Apache Spark committer at Databricks. He has been developing Structured Streaming and Delta Lake to simplify the lives of Data Engineers. Burak received his MS in Management Science & Engineering at Stanford and his BS in Mechanical Engineering at Bogazici University, Istanbul.