Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake.
Last week, we hosted a virtual event highlighting Delta Lake, an open source storage layer that brings reliability, performance and security to your data lake. We had amazing engagement from the audience, with almost 200 thoughtful questions submitted! While we can’t answer all in this blog, we thought we should share answers to some of the most popular questions. For those who weren’t able to attend, feel free to take a look at the on-demand version here.
For those who aren’t familiar, Delta Lake is an open format, transactional storage layer that forms the foundation of a lakehouse. Delta Lake delivers reliability, security and performance on your data lake — for both streaming and batch operations — and eliminates data silos by providing a single home for structured, semi-structured and unstructured data. Ultimately, by making analytics, data science and machine learning simple and accessible across the enterprise, Delta Lake is the foundation and enabler of a cost-effective, highly-scalable lakehouse architecture.
Before diving into your questions, let’s start by establishing what the difference is between a data warehouse, a data lake and a lakehouse:
Data warehouses are data management systems with a structured format, designed to support business intelligence. While great for structured data, the world's data continues to get more and more complex, and data warehouses are not suited for many of the use cases we have today, primarily involving a variety of data types. On top of that, data warehouses are expensive and lock users into a proprietary format.
Data lakes were developed in response to the challenges of data warehouses and have the ability to collect large amounts of data from many different sources in a variety of formats. While suitable for storing data and keeping costs low, data lakes lack some critical reliability and performance features like transactions, data quality enforcement and consistency/isolation, ultimately leading to severe limitations in their usability.
A lakehouse brings the best of data warehouses and data lakes together - all through an open and standardized system design. By adding a transaction layer on top of your data lake, you can enable critical capabilities like ACID transactions, schema enforcement/evolution and data versioning that provide reliability, performance and security to your data lake. A lakehouse is a scalable, low-cost option that unifies data, analytics and AI.
How does Delta Lake compare to other transactional storage layers?
While Delta Lake and other transaction storage layers aim to solve many of the same challenges, Delta Lake has broader use case coverage across the data ecosystem. In addition to bringing reliability, performance and security to data lakes, Delta Lake provides a unified framework for batch and streaming workloads, improving efficiency in not only data transformation pipelines, but also downstream activities like BI, data science and ML. Using Delta Lake on Databricks provides, among other benefits, better performance with Delta Engine, better security and governance with fine-grained access controls, and broader ecosystem support with faster native connectors to the most popular BI tools. Finally, Delta Lake on Databricks has been battled-tested and used in production for over 3 years by thousands of customers. Every day, Delta Lake ingests at least 3PB of data.
How do I ingest data into Delta Lake?
Ingesting data into Delta Lake is very easy. You can automatically ingest new data files into Delta Lake as they land in your data lake (e.g. on S3 or ADLS) using Databricks Auto Loader or the COPY INTO command with SQL. You can also use Apache Spark™ to batch read your data, perform any transformations and save the result in Delta Lake format. Learn more about ingesting data into Delta Lake here.
Is Delta Lake on Databricks suitable for BI and reporting use cases?
Yes, Delta Lake works great with BI and reporting workloads. To address this data analysis use case, in particular, we recently announced the release of SQL Analytics, which is currently in public preview. SQL Analytics is designed specifically for BI use cases and enables customers to perform analytics directly on their data lake. So if you have a lot of users that are going to be querying your Delta Lake table, we suggest taking a look at SQL Analytics. You can either leverage the build-in query and dashboarding capabilities or connect your favorite BI tool with native optimized connectors.
Apart from data engineering, does Delta Lake help with ML and training ML models?
Yes, Delta Lake provides the ability to version your data sets, which is a really important feature when it comes to reproducibility. The ability to essentially pin your models to a specific version of your dataset is extremely valuable because it allows other members of your data team to step in and reproduce your model training to make sure they get the exact same results. It also allows you to ensure you are training on the exact same data, and the exact same version of the data specifically, that you trained your model on as well. Learn more about ML on Databricks.
How does Delta Lake help with compliance? And how does Delta Lake handle previous versions of data on delete for GDPR and CCPA support?
Delta Lake gives you the power to purge individual records from the underlying files in your data lake, which has tremendous implications for regulations like CCPA and GDPR.
When it comes to targeted deletion, in many cases, businesses will actually want those deletions to propagate down to their cloud object-store. By leveraging a managed table in Delta Lake, where the data is managed by Databricks, deletions are propagated down to your cloud object-store.
Does Delta Lake provide access controls for security and governance?
Yes, with Delta Lake on Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects (folders, notebooks, experiments and models), clusters, pools, jobs, and data schemas, tables, views, etc. Admins can manage access control lists, as can users who have been given delegated permissions to manage access control lists. Learn more about data governance best practices on Databricks.
How does Delta Lake help with streaming vs. batch operations?
With Delta Lake, you can run both batch and streaming operations on one simplified architecture that avoids complex, redundant systems and operational challenges. A table on Delta Lake is both a batch table and a streaming source and sink. Streaming data ingest, batch historic backfill and interactive queries all work out of the box and directly integrate with Spark Structured Streaming.
This is just a small sample of the amazing engagement we received from all of the attendees during this event. If you were able to join live, thank you for taking the time to learn about Delta Lake and how it forms the foundation of a scalable, cost-efficient, lakehouse. If you haven’t had a chance to check out the event you can view it here.