Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake.
In this first of two blogs, we want to talk about WHY an organization might want to look at a lakehouse architecture (based on Delta Lake) for their data analytics pipelines instead of the standard patterns of lifting and shifting their Enterprise Data Warehouse (EDW) from on-prem or in the cloud. We will follow this case with a second detailed blog on HOW to make just such a transition shortly.
Enterprise Data Warehouses have stood the test of time in organizations. They deliver massive business value. Businesses need to be data driven and want to glean insights from their data and EDWs are proven work horses to do just that.
However, as time has gone along, there have been several issues identified in the EDW architecture. Broadly speaking these issues can be attributed to the four characteristics of big data, commonly known as the “4 Vs” or volume, velocity, variety, and veracity that are problematic for legacy architectures. The following reasons further illustrate limitations of an EDW based architecture:
Figure 1: Typical flow in the EDW world and its limitations
So, really, as a result of these challenges, the requirements became clearer for the EDW community:
Simply put - the architecture must support all velocity, variety and volume of data, enable business intelligence and production grade data science at optimal cost.
Now, if we start talking about a Cloud Data Lake architecture the one major thing that this brings to the table for the organization is extremely cheap storage. When you think about Azure Blob or Azure Data Lake Gen2 as well as AWS S3, you can store TB scale data for a few dollars. This frees the organization from being beholden to analytics apparatus where the disk storage costs are many multiples of that. BUT, this only happens, if the organization takes advantage of separating compute from storage. By this we mean that the data must persist separately from your compute infrastructure. In nominal terms, on AWS, your data would reside on S3 (or ADLS Gen2/Blob on Azure) while your compute would spin up as and when required.
With that in mind, let us take a look at the architecture for a modern cloud data lake architecture
For this curated data lake, we want to focus on things that an organization has to think about in building this layer to avoid the pitfalls of the data lakes of yesteryear. In those there was a strong notion of “garbage in garbage out”. One of the key reasons for that property in the data lakes of the past was because of the reliability of data. Data could land with the wrong schema, could be corrupted etc. and it would just get ingested into the data lake. Only later, when that data is queried do the problems really arise. So reliability is a major requirement to think about.
Another one that matters, of course, is performance. We could play a lot of tricks to make the data reliable, but it is no good if a simple query takes forever to return.
Yet another one that matters is that, as an organization, you might start to think about data in levels of curation. You might have a raw tier, a refined tier and a BI tier. Generally, raw tier is your incoming data, the refined tier is imposing requirements of schema enforcement and reliability checks and the BI tier has clean data with aggregations ready to build out dashboards for executives. We also need to think about a process to move between these tiers in a simplistic way.
Also we want to keep compute and storage separate - and the reason we want to do this is because in the cloud compute costs can weigh heavily on the organization. You want to store it on the object store giving you a cheap persistent layer. Bring your compute to the data for only as long as you need it and then turn it off. As an example, bring up a very large cluster to perform ETL against your data for a few minutes and shut it off after the process is done. On the query side, you can keep ALL of your data going back decades on S3 and bring up a small cluster in the case you only need to query the last few years. This flexibility is of paramount importance. What this really implies is that the reliability and performance we are talking about have to be inherent properties of how the data is stored.
Figure 3: A Cloud Curated Data Lake architecture
So, say we have a data format for this curated Data Lake layer that gives us inherent reliability and performance properties coupled with the fact that the data stays completely under the organization’s control, you now need a query engine that allows you to access this format. We think the choice here, at least for now, is Apache Spark. Apache Spark is battle tested, supports ETL, streaming, SQL and ML workloads.
So, this data format, from a Databricks perspective, is Delta Lake. Delta Lake in an open source format that is maintained by the Linux Foundation. There are others you will hear about as well - Apache Hudi and Iceberg. They are trying to solve for the reliability property required on the data lake. The big difference, however, is that at this point, Delta Lake processes 2.5 exabytes per month. It is a battle tested data format for the cloud data lake amongst Fortune 500 companies and being leveraged across all verticals from financial services, to ad tech to automotive and public sector.
Delta Lake coupled with Spark gives you the capability to move easily between the data lake curation stages. In fact, you could incrementally ingest incoming data in raw tier and be assured to see it move through the transformation stages all the way through to the BI tier with ACID guarantees.
We at Databricks realize that this is the vision a lot of the organizations are looking to implement. So, when you look at Databricks as a Unified Data Analytics Platform, what you see is:
Figure 5: A Databricks centric Curated cloud Data Lake solution
We will follow this blog on WHY you should consider a Data Lake as you look to modernize in the cloud with a HOW blog. We will focus on specific aspects to think of and know about as you orient yourself from a traditional Data warehouse to a Data Lake.