With the fast-moving evolution of the data lake, Billy Bosworth and Ali Ghodsi share their mutual thoughts on the top 5 common questions they get asked about data warehouses, data lakes and lakehouses. Coming from different backgrounds, they each provide unique and valuable insights into this market. Ali has spent more than 10 years on the forefront of research into distributed data management systems; is an adjunct professor at UC Berkeley; and is the co-founder and now CEO of Databricks. Billy has spent 30 years in the world of data as a developer, database administrator and author; has served as CEO and senior executive at software companies specializing in databases; has served on public company boards, and is currently the CEO of Dremio.
What went wrong with Data Lakes?
Let’s start with one good thing before we get to the problems. They enabled enterprises to capture all their data – video/audio/logs – not just the relational data, and they did so in a cheap and open way. Today, thanks to this, the vast majority of the data, especially in the cloud, is in data lakes. Because they’re based on open formats and standards (e.g. Parquet and ORC), there is also a vast ecosystem of tools, often open sourced (e.g. Tensorflow, Pytorch), which can directly operate on these data lakes. But at some point, just collecting data for the sake of collecting it is not useful, and nobody cares about how many petabytes you’ve collected, but what have you done for the business? What business value did you provide?
It turned out it was hard to provide business value because the data lakes often became data swamps. This was primarily due to three factors. First, it was hard to guarantee that the quality of the data was good because data was just dumped into it. Second, it was hard to govern because it’s a file store, and reasoning about data security is hard if the only thing you see are files. Third, it was hard to get performance because the data layout might not be organized for performance, e.g. millions of tiny comma-separated-files (CSVs).
All technologies evolve, so rather than think about “what went wrong” I think it’s more useful to understand what the first iterations were like. First, there was a high correlation between the words “data lake” and “Hadoop.” This was an understandable association, but the capabilities now available in data lake architectures are much more advanced and easier than anything we saw in the on-prem Hadoop ecosystem. The second is that data lakes became more like swamps where data just sat and accumulated without delivering real insight to the business. I think this happened due to overly complex on-premises ecosystems without the right technology to seamlessly and quickly allow the data consumers to get the insights they needed directly from the data in the lake. Finally, like any new technology, it lacked some of the mature aspects of databases such as robust governance and security. A lot has changed, especially in the past couple of years, but those seem to be some of the common early issues.
What do you see as the biggest changes in the last several years to overcome some of those challenges?
A defacto upstream architecture decision is what really got the ball rolling. In the past few years, application developers simply took the easiest path to storing their large datasets, which was to dump them in cloud storage. Cheap, infinitely scalable and extremely easy to use, cloud storage became the default choice for people to land their cloud-scale data coming out of web and IoT applications. That massive accumulation of data pushed the innovation that was necessary to access the data directly where it lived versus trying to keep up with copies to traditional databases. Today, we have a rich set of capabilities that deliver things previously only possible in relational data warehouses.
The big technological breakthrough came around 2017 when three projects simultaneously enabled building warehousing-like capabilities directly on the data lake: Delta Lake, Hudi and Iceberg. They brought structure, reliability, and performance to these massive datasets sitting in data lakes. It started with enabling ACID transactions, but soon went beyond that with performance, indexing, security, etc. This breakthrough was so profound that it was published in the top academic conferences (VLDB, CIDR etc).
Why use another new term, “Lakehouse” to describe data lakes?
Because they’re so radically different from data lakes that it warrants a different term. Data lakes tend to become data swamps for the three reasons I mentioned earlier, so we don’t want to encourage more of that, as it’s not good for enterprises. The new term also gives us the opportunity to guide these enterprises to land a data strategy that can provide much more business value rather than repeating the mistakes of the past.
If you look at something like Werner Vogel’s blog post from Jan 2020 highlighting the tremendous advantages and capabilities of an open data lake architecture, you see a giant evolution from how data lakes were perceived even just a few years ago. Mostly this is true for data analytics use cases that were only thought to be possible in a data warehouse. Therefore, the term “Lakehouse” brings a new connotation to the current world of open data architectures, allowing for fresh association with rich data analytics capabilities. When underlying technologies evolve dramatically, new names are often created to represent new capabilities. That is what I think we see happening with the term “Lakehouse.”
Why consider Lakehouses at all? Why not just continue to use data warehouses?
The data problems of today are just not a little different from the past, they are radically, categorically different. Among the many issues with data warehouses is time. Not the time it takes them to run a query, but the time it takes data teams to get the data into and out of the data warehouse using a labyrinth of ETL jobs. This highly complex chain of data movement and copying introduces onerous change management (a “simple” change to a dashboard is anything but simple), adds data governance risks and ultimately decreases the scope of data available for analytics because subsets tend to get created with each copy.
Often I hear people talk about the “simplicity” of a data warehouse. Zoom out just a tiny bit and you will always find a dizzying web of interconnected data copy and movement jobs. That is not simple. So the question is, why go through all that copying and moving if you don’t have to? In a Lakehouse, the design principle is that once the data hits data lake storage, that’s where it stays. And the data is already hitting data lake storage, even before the analytics team has anything to say about it. Why? Because as I said earlier, developers now use it as the de facto destination for their data exhaust. So once it’s there, why move it anywhere else? With a Lakehouse, you don’t have to.
The most important reason has to do with machine learning and AI, which is very strategic for most enterprises. Data warehouses don’t have support for sparse data sets that ML/AI uses, such as video, audio and arbitrary text. Furthermore, the only way to communicate with them is through SQL, which is amazing for many purposes, but not so much for ML/AI. Today, a vast open ecosystem of software is built on Python, for which SQL is not adequate. Finally, the vast majority of the data today is stored in data lakes, so migrating all of that into a data warehouse is nearly impossible and cost-prohibitive.
Other than eliminating data copies, what do you personally consider to be the biggest advantages of a Lakehouse?
The direct support for ML/AI. This is where the puck is going. Google would not be around today if it wasn’t for AI or ML. The same is true for Facebook, Twitter, Uber, etc. Software is eating the world, but AI will eat all software. Lakehouses can support these workloads natively. If I can mention more than one advantage, I would say that there are already massive datasets in data lakes, and the Lakehouse paradigm enables making use of that data. In short, it lets you clean up your data swamp.
I’ve spent my entire career working with databases, and almost all of it on the operational side. As I recently moved more into the world of data analytics, frankly, I felt like I was in a time machine when I saw the data warehouse model still being used. On the operational side of the world, architectures have long since moved from big and monolithic to services-based. The adoption of these services-based architectures is so complete that it hardly bears mentioning. And yet, when you look at a data warehouse-centric architecture, it’s like looking at an application architecture from 2000. All the advantages of services-based architectures apply to the analytics world just as much as they do to the operational world. A Lakehouse is designed to make your data accessible to any number of services you wish, all in open formats. That is really key for today and the future. Modular, best-of-breed, services-based architectures have proven to be superior for operational workloads. Lakehouse architectures allow the analytics world to quickly catch up.
Does implementing a Lakehouse mean “ripping and replacing” the data warehouse?
Perhaps the best thing about implementing a Lakehouse architecture is that your application teams have already likely started the journey. Companies have datasets already available that make it easy to get started implementing a Lakehouse architecture. Unwinding things from the data warehouse is not necessary. The most successful customer implementations we see are ones that start with a single use case, successfully implement it, then ask “what other use cases should we implement directly on the Lakehouse instead of copying data in the data warehouse?”
No it does not. We haven’t seen anyone do it that way. Rather, the Data Warehouse becomes a downstream application of the Lakehouse, just like many other things. Your raw data lands in the data lake. The Lakehouse enables you to curate it into refined datasets with schema and governance. Subsets of that can then be moved into data warehouses. This is how everyone starts, but as the use cases on the Lakehouse get more successful, almost all enterprises we have worked with end up moving more and more workloads directly to the Lakehouse.