by Dipankar Kushari and Uday Satapathy
Apache Parquet is one of the most popular open source file formats in the big data world today. Being column-oriented, Apache Parquet allows for efficient data storage and retrieval, and this has led many organizations over the past decade to adopt it as an essential way to store data in data lakes. Some of these companies went one step further and decided to use Apache Parquet files as 'database tables' – performing CRUD operations on them. However, Apache Parquet files, being just data files, without any transaction logging, statistics collection and indexing capabilities aren't good candidates for ACID compliant database operations. Building such tooling is a monumental task that would require a huge development team to develop on their own and to maintain it. The result was an Apache Parquet Data Lake. It was a makeshift solution at its best, suffering from issues such as accidental corruption of tables arising from brittle ACID compliance.
The solution came in the form of the Delta Lake format. It was designed to solve the exact problems Apache Parquet data lakes were riddled with. Apache Parquet was adopted as the base data storage format for Delta Lake and the missing transaction logging, statistics collection and indexing capabilities were built in, providing it with the much needed ACID compliance and guarantees. Open source Delta Lake, under the Linux Foundation, has been going from strength to strength, finding wide usage in the industry.
Over time, organizations have realized significant benefits moving their Apache Parquet data lake to Delta Lake, but it takes planning and selection of the right approach to migrate the data. There can even be scenarios where a business needs the Apache Parquet Data Lake to co-exist even after migrating the data to Delta Lake. For example, you might have an ETL pipeline that writes data to tables stored in Apache Parquet Data lake and you need to perform a detailed impact analysis before gradually migrating the data to Delta Lake. Until such time, you need to keep their Apache Parquet Data Lake and Delta Lake in sync. In this blog we will discuss a few similar use cases and show you how to tackle them.
You can refer to this Databricks Blog series to understand Delta Lake internal functionality.
The methodology that needs to be adopted for the migration of Apache Parquet Data Lake to Delta Lake depends on one or many migration requirements which are documented in the matrix below.
Requirements ⇨ Methods ⇩ |
Complete overwrite at source | Incremental with append at source | Duplicates data | Maintains data structure | Backfill data | Ease of use |
---|---|---|---|---|---|---|
Deep CLONE Apache Parquet | Yes | Yes | Yes | Yes | Yes | Easy |
Shallow CLONE Apache Parquet | Yes | Yes | No | Yes | Yes | Easy |
CONVERT TO DELTA | Yes | No | No | Yes | No | Easy |
Auto Loader | Yes | Yes | Yes | No | Optional | Some configuration |
Batch Apache Spark job | Custom logic | Custom logic | Yes | No | Custom logic | Custom logic |
COPY INTO | Yes | Yes | Yes | No | Optional | Some configuration |
Table 1 - Matrix to show options for migrations
Now let's discuss the migration requirements and how that impacts the choice of migration methodologies.
You can use Databricks deep clone functionality to incrementally convert data from the Apache Parquet Data lake to the Delta Lake. Use this approach when all of the below criteria are satisfied:
You can use Databricks shallow clone functionality to incrementally convert data from Apache Parquet Data lake to Delta Lake, when you:
You can use Convert to Delta Lake feature if you have requirements for:
Since the source is transformed into a target Delta Lake table in-place, all future CRUD operations on the target table need to happen through Delta Lake ACID transactions.
Note - Please refer to the Caveats before using the CONVERT TO DELTA option. You should avoid updating or appending data files during the conversion process. After the table is converted, make sure all writes go through Delta Lake.
You can use Auto Loader to incrementally copy all data from a given cloud storage directory to a target Delta table. This approach can be used for the below conditions:
You can use COPY INTO SQL command to incrementally copy all data from a given cloud storage directory to a target Delta table. This approach can be used for the below conditions:
Both Auto Loader and COPY INTO allow the users plenty of options to configure the data movement process. Refer to this link when you need to decide between COPY INTO and Auto Loader.
Finally, you can use custom Apache Spark logic to migrate to Delta Lake. It provides great flexibility in controlling how and when different data from your source system is migrated, but might require extensive configuration and customization to provide capabilities already built into the other methodologies discussed here.
To perform backfills or incremental migration, you might be able to rely on the partitioning structure of your data source, but might also need to write custom logic to track which files have been added since you last loaded data from the source. While you can use Delta Lake merge capabilities to avoid writing duplicate records, comparing all records from a large Parquet source table to the contents of a large Delta table is a complex and computationally expensive task.
Refer to this link for more information on the methodologies of migrating your Apache Parquet Data Lake to Delta Lake.
In this blog, we have described various options to migrate your Apache Parquet Data Lake to Delta Lake and discussed how you can determine the right methodology based on your requirements. To learn more about the Apache Parquet to Delta Lake migration and how to get started, please visit the guides (AWS, Azure, GCP). In these Notebooks we have provided a few examples for you to get started and try different options for migration. Also it is always recommended to follow optimization best practices on Databricks after you migrate to Delta Lake.