Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake.
If you examine the agenda for any of the Spark Summits in the past five years, you will notice that there is no shortage of talks on how best to architect a data lake in the cloud using Apache Spark™ as the ETL and query engine and Apache Parquet as the preferred file format. There are talks that give advice on how to [and how not to] partition your data, how to calculate the ideal file size, how to handle evolving schemas, how to build compaction routines, how to recover from failed ETL jobs, how to stream raw data into the data lake, etc.
Databricks has been working with customers throughout this time to encapsulate all of the best practices of a data lake implementation into Delta Lake, which was open-sourced at Spark +AI Summit in 2019. There are many benefits to converting an Apache Parquet Data Lake to a Delta Lake, but this blog will focus on the Top 5 reasons:
- Prevent Data Corruption
- Faster Queries
- Increase Data Freshness
- Reproduce ML Models
- Achieve Compliance
Fundamentally, Delta Lake maintains a transaction log alongside the data. This enables each Delta Lake table to have ACID-compliant reads and writes.
Prevent Data Corruption
Data lakes were originally conceived as an on-premise big data complement to data warehousing on top of HDFS and not in the cloud. When the same design pattern was replicated onto a blob data storage, like Amazon Web Services (AWS) S3, unique challenges ensued because of eventual consistency properties. The original Hadoop commit protocol assumes RENAME functionality for transactions, which exist on HDFS but not in S3. This forced engineers to choose from two different Hadoop commit protocols to either be safe but slow, or fast but unsafe. Prior to releasing Delta Lake, Databricks developed their own commit protocol to first address this. This Spark Summit presentation from 2017, Transactional Writes to Cloud Storage, explains the challenges and our then-solution to address it.
Delta Lake was designed from the beginning to accommodate blob storage and the eventual consistency and data quality properties that come with it. If an ETL job fails against a Delta Lake table before fully completing, it will not corrupt any queries. Each SQL query will always refer to a consistent state of the table. This allows an enterprise data engineer to troubleshoot why an ETL job may have failed, fix it, and re-run it without needing to worry about alerting users, purging partially written files, or reconciling to a previous state.
Before Delta Lake, a common design pattern is to partition the first stage of data by a batch id so that if a failure occurred upon ingestion, the partition could be dropped and a new one created on retry. Although this pattern helps with ETL recoverability, it usually results in many partitions with a few small Parquet files; thereby impeding downstream query performance. This is typically rectified by duplicating the data into other tables with broader partitions. Delta Lake still supports partitions, but you only need to match them to expected query patterns, and only if each partition contains a substantial amount of data. This ends up eliminating many partitions in your data and improving performance by scanning fewer files.
Spark allows you to merge different Parquet schemas together with the mergeSchema option. With a regular Parquet data lake, the schema can differ across partitions, but not within partitions. However, a Delta Lake table does not have this same constraint. Delta Lake gives the engineer a choice to either allow the schema of a table to evolve, or to enforce a schema upon write. If an incompatible schema change is detected, Delta Lake will throw an exception and prevent the table from being corrupted with columns that have incompatible types. Additionally, a Delta Lake table may include NOT NULL constraints on columns, which cannot be enforced on a regular Parquet table. This prevents records from being loaded with NULL values for columns which require data (and could break downstream processes).
One final way that Delta Lake prevents data corruption is by supporting the MERGE statement. Many tables are structured to be append-only, however, it is not uncommon for duplicate records to enter pipelines. By using a MERGE statement, a pipeline can be configured to INSERT a new record or ignore records that are already present in the Delta Table.
Faster Queries
Delta Lake has several properties that can make the same query much faster compared to regular parquet. Rather than performing an expensive LIST operation on the blob storage for each query, which is what the regular Parquet reader would do, the Delta transaction log serves as the manifest.
The transaction log not only keeps track of the Parquet filenames but also centralizes their statistics. These are the min and max values of each column that is found in the Parquet file footers. This allows Delta Lake to skip the ingestion of files if it can determine that they do not match the query predicate.
Another technique to skip the ingestion of unnecessary data is to physically organize the data in such a way that query predicates only map to a small number of files. This is the concept behind the ZORDER reorganization of data. This table design accommodates fast queries on columns that are not part of the partition key. The combination of these data skipping techniques is explained in the 2018 blog:
Processing Petabytes of Data in Seconds with Databricks Delta.
At Spark+AI Summit 2020, Databricks announced our new Delta Engine, which adds even more performance enhancements. It has an intelligent caching layer to cache data on the SSD/NVME drives of a cluster as it is ingested; thereby making subsequent queries on the same data faster. It has an enhanced query optimizer to speed up common query patterns. However, the biggest innovation is the implementation of Photon, a native vectorization engine written in C++. All together, these components deliver Delta Engine a significant performance gain over Apache Spark, while still keeping the same open APIs.
Because Delta Lake is an open-source project, a community has been forming around it and other query engines have built support for it. If you’re already using one of these query engines, you can start to leverage Delta Lake and achieve some of the benefits immediately.
- Apache Hive
- Azure Synapse Analytics
- Presto and AWS Athena
- AWS Redshift Spectrum
- Snowflake
- Starburst Enterprise Presto
Increase Data Freshness
Many Parquet Data Lakes are refreshed every day, sometimes every hour, but rarely every minute. Sometimes this is linked to the grain of an aggregation, but often this is due to technical challenges with being able to stream real-time data into a data lake. Delta Lake was designed from the beginning to accommodate both batch and streaming ingestion use cases. By leveraging Structured Streaming with Delta Lake, you automatically get built-in checkpointing when transforming data from one Delta Table to another. With a single config change of the Trigger, the ingestion can be changed from batch to streaming.
One challenge is accommodating streaming ingestion is that more frequent writes generate many very small files, which negatively impacts downstream query performance. As a general rule, it is faster to query a smaller number of large files than it is to query a large number of small files. Each table should attempt to achieve uniform parquet file size; typically somewhere between 128MB - 512MB. Over the years, engineers have developed their own compaction jobs to compact these small files into larger files. However, because blob storage lacks transactions, these routines are typically run in the middle of the night to prevent downstream queries from failing. Delta Lake can compact these files with a single OPTIMIZE command, and because of ACID compliance, it can be run at the same time that users query the table. Likewise, Delta Lake can leverage auto-optimize to continuously write files in the optimal size.
Reproduce ML Models
In order for a machine learning model to be improved, a data scientist must first reproduce the results of the model. This can be particularly daunting if the data scientist who trained the model has since left the company. This requires that the same logic, parameters, libraries, and environment must be used. Databricks developed MLflow in 2018 to solve this problem.
The other element that needs to be tracked for reproducibility is the training and test data sets. The Time Travel feature enables the ability to query the data as it was at a certain point in time using data versioning. So you can reproduce the results of a machine learning model by retraining it with the exact same data without needing to copy the data.
Achieve Compliance
New laws such as GDPR and CCPA require that companies be able to purge data pertaining to a customer should a request by the individual be made. Deleting or updating data in a regular Parquet Data Lake is compute-intensive. All of the files that pertain to the personal data being requested must be identified, ingested, filtered, written out as new files, and the original ones deleted. This must be done in a way as to not disrupt or corrupt queries on the table.
Delta Lake includes DELETE and UPDATE actions for the easy manipulation of data in a table. For more information, please refer to this article Best practices: GDPR and CCPA compliance using Delta Lake and this tech talk Addressing GDPR and CCPA Scenarios with Delta Lake and Apache Spark™.
Summary
In summary, there are many benefits from switching your Parquet Data Lake to a Delta Lake, but the top 5 are:
- Prevent Data Corruption
- Faster Queries
- Increase Data Freshness
- Reproduce ML Models
- Achieve Compliance
One final reason to consider switching your Parquet Data Lake to a Delta Lake is that it is simple and quick to change a table with the CONVERT command, and equally simple to undo the conversion. Give Delta Lake a try today!