What are the challenges with data lakes?

Data lakes provide a complete and authoritative data store that can power data analytics, business intelligence and machine learning

What are the challenges with data lakes?

Challenge #1: Data reliability

Without the proper tools in place, data lakes can suffer from reliability issues that make it difficult for data scientists and analysts to reason about the data. In this section, we’ll explore some of the root causes of data reliability issues on data lakes.

Reprocessing data due to broken pipelines

With traditional data lakes, the need to continuously reprocess missing or corrupted data can become a major problem. It often occurs when someone is writing data into the data lake, but because of a hardware or software failure, the write job does not complete. In this scenario, data engineers must spend time and energy deleting any corrupted data, checking the remainder of the data for correctness, and setting up a new write job to fill any holes in the data.

Delta Lake solves the issue of reprocessing by making your data lake transactional, which means that every operation performed on it is atomic: it will either succeed completely or fail completely. There is no in between, which is good because the state of your data lake can be kept clean. As a result, data scientists don't have to spend time tediously reprocessing the data due to partially failed writes. Instead, they can devote that time to finding insights in the data and building machine learning models to drive better business outcomes.

Data validation and quality enforcement

When thinking about data applications, as opposed to software applications, data validation is vital because without it, there is no way to gauge whether something in your data is broken or inaccurate which ultimately leads to poor reliability. With traditional software applications, it’s easy to know when something is wrong — you can see the button on your website isn’t in the right place, for example. With data applications, however, data quality problems can easily go undetected. Edge cases, corrupted data, or improper data types can surface at critical times and break your data pipeline. Worse yet, data errors like these can go undetected and skew your data, causing you to make poor business decisions.

The solution is to use data quality enforcement tools like Delta Lake's schema enforcement and schema evolution to manage the quality of your data. These tools, alongside Delta Lake's ACID transactions, make it possible to have complete confidence in your data, even as it evolves and changes throughout its lifecycle and ensure data reliability. Learn more about Delta Lake.

Combining batch and streaming data

With the increasing amount of data that is collected in real time, data lakes need the ability to easily capture and combine streaming data with historical, batch data so that they can remain updated at all times. Traditionally, many systems architects have turned to a lambda architecture to solve this problem, but lambda architectures require two separate code bases (one for batch and one for streaming), and are difficult to build and maintain.

With Delta Lake, every table can easily integrate these types of data, serving as a batch and streaming source and sink. Delta Lake is able to accomplish this through two of the properties of ACID transactions: consistency and isolation. These properties ensure that every viewer sees a consistent view of the data, even when multiple users are modifying the table at once, and even while new data is streaming into the table all at the same time.

Bulk updates, merges and deletes

Data lakes can hold a tremendous amount of data, and companies need ways to reliably perform update, merge and delete operations on that data so that it can remain up to date at all times. With traditional data lakes, it can be incredibly difficult to perform simple operations like these, and to confirm that they occurred successfully, because there is no mechanism to ensure data consistency. Without such a mechanism, it becomes difficult for data scientists to reason about their data.

One common way that updates, merges and deletes on data lakes become a pain point for companies is in relation to data regulations like the CCPA and GDPR. Under these regulations, companies are obligated to delete all of a customer’s information upon their request. With a traditional data lake, there are two challenges with fulfilling this request. Companies need to be able to:

Query all the data in the data lake using SQL
Delete any data relevant to that customer on a row-by-row basis, something that traditional analytics engines are not equipped to do

Delta Lake solves this issue by enabling data analysts to easily query all the data in their data lake using SQL. Then, analysts can perform updates, merges or deletes on the data with a single command, owing to Delta Lake’s ACID transactions. Read more about how to make your data lake CCPA compliant with a unified approach to data and analytics.

Challenge #2: Query performance

Query performance is a key driver of user satisfaction for data lake analytics tools. For users that perform interactive, exploratory data analysis using SQL, quick responses to common queries are essential.

Data lakes can hold millions of files and tables, so it’s important that your data lake query engine is optimized for performance at scale. Some of the major performance bottlenecks that can occur with data lakes are discussed below.

Small files

Having a large number of small files in a data lake (rather than larger files optimized for analytics) can slow down performance considerably due to limitations with I/O throughput. Delta Lake uses small file compaction to consolidate small files into larger ones that are optimized for read access.

Unnecessary reads from disk

Repeatedly accessing data from storage can slow query performance significantly. Delta Lake uses caching to selectively hold important tables in memory, so that they can be recalled quicker. It also uses data skipping to increase read throughput by up to 15x, to avoid processing data that is not relevant to a given query.

Deleted files

On modern data lakes that use cloud storage, files that are “deleted” can actually remain in the data lake for up to 30 days, creating unnecessary overhead that slows query performance. Delta Lake offers the VACUUM command to permanently delete files that are no longer needed.

Data indexing and partitioning

For proper query performance, the data lake should be properly indexed and partitioned along the dimensions by which it is most likely to be grouped. Delta Lake can create and maintain indices and partitions that are optimized for analytics.

Metadata management

Data lakes that grow to become multiple petabytes or more can become bottlenecked not by the data itself, but by the metadata that accompanies it. Delta Lake uses Spark to offer scalable metadata management that distributes its processing just like the data itself.

Challenge #3: Governance

Data lakes traditionally have been very hard to properly secure and provide adequate support for governance requirements. Laws such as GDPR and CCPA require that companies are able to delete all data related to a customer if they request it. Deleting or updating data in a regular Parquet Data Lake is compute-intensive and sometimes near impossible. All the files that pertain to the personal data being requested must be identified, ingested, filtered, written out as new files, and the original ones deleted. This must be done in a way that does not disrupt or corrupt queries on the table. Without easy ways to delete data, organizations are highly limited (and often fined) by regulatory bodies.

Data lakes also make it challenging to keep historical versions of data at a reasonable cost, because they require manual snapshots to be put in place and all those snapshots to be stored.