Why Lakehouse Architecture Now?

May 26, 2021 03:15 PM (PT)

This talk will explore failures of the enterprise data warehouse paradigm which have created the need for the lakehouse paradigm. Starburst will also showcase how we power SQL-based interactive analytics on Delta Lake, the foundation for lakehouse architecture.

In this session watch:
Justin Borgman, Co-Founder & CEO, Starburst
Kamil Bajda-Pawlikowski, Co-Founder & CTO, Starburst
Daniel Abadi, PhD, Darnell-Kanall Professor of Computer Science and Chief Scientist, Starburst

 

Transcript

Justin Boardman: Thank you all for joining us. We’re going to talk today about lakehouse architecture and why it’s becoming increasingly prevalent in the market today. As far as the agenda, we’re going to first discuss the need for a better approach to data management and analytics. We’re going to talk about the benefits of a lakehouse approach. We’re going to talk about why data sharing is becoming increasingly critical to enterprises and how that’s evolved over time. And then we’re going to talk about how Starburst specifically supports a lakehouse architecture, providing really fast SQL query results against data in your data lake. To introduce myself, my name is Justin Boardman, I’m co-founder and CEO of Starburst. I first got my start in the data analytics space a little over a decade ago with the founding of my first company called Hadapt. I’m lucky to be joined here today with two of my co-founders from that company. We pioneered this notion of doing “SQL on Hadoop”.
Again, we were very early to think about data lakes and doing data warehousing analytics within a data lake context. From there, Hadapt was acquired by Teradata in 2014, and I became a VP and GM at Teradata focused on emerging technologies, including both our Hadoop portfolio as well as exploring new avenues for doing data warehousing analytics going forward. And that’s ultimately where we decided to come up with the idea for creating Starburst, which we founded in 2017. A little bit about Starburst. Starburst is a distributed query engine that allows you to access data anywhere. It’s built on an open source project called Trino, formerly known as Presto SQL. And what we provide is two platforms, one called Starburst Enterprise, and one called Galaxy. Galaxy is a cloud offering. Starbursts Enterprise runs wherever you’d like it to run. And ultimately these platforms provide enterprise-grade, lightening-fast SQL analytics to data anywhere. Again, allowing you to query the data where it lives, which is very core to our value proposition.
As a company, we’re a little over 150 customers today. We’ve raised about $164 million in venture capital, most recently from Andreessen Horwitz, joined by Index Ventures and COTU, and we’re absolutely obsessed with our customers. And fortunately they seem to like us as well. The need for a better approach to data management and analytics. Let’s talk about the landscape. Today’s approach requires too much copying and moving of data. The business has a question and a lot of preparation is necessary to ultimately get to the answer. We call that the “time to insight”, the time from having a question to getting an answer. And ultimately, that involves multiple copies, ETL, moving data into data warehouses, cloud data warehouses, cloud data lakes, and on-prem data lakes. And at the end of the day, the business doesn’t really care where the data lives. They simply want the answer to their question.
One of the approaches that’s existed now for decades is this notion of moving all of your data, creating copies, perhaps on a nightly or weekly basis, migrating it all into one central location, typically known as an enterprise data warehouse. This approach is attractive on its surface because it allows you to consolidate the data in a place where you can now analyze and understand your business holistically, but it has a lot of drawbacks. First and foremost, there’s the time required to actually extract data from all of those different data sources, transform it, and load it into this new technology. In addition, data warehousing systems are closed. They’re proprietary, and the data is stored in proprietary data formats. This creates vendor lock-in, as you become more and more beholden to your data warehousing provider and your data is effectively stuck in that environment. It also makes it very expensive and time consuming to move data around, to try to get data in and out of this data warehousing system.
The reality is most of your data is probably in a data lake already because they’re very low cost systems for storing massive amounts of data. It also allows you to store structured and unstructured data. So there are advantages to working with the data lake if you can. It’s also particularly challenging to share data across data warehousing providers, because the data itself is generally proprietary, requires you to remove data and move it around, and even popular new data sharing technologies are essentially proprietary networks where each of you has to be using the same type of data warehouse to be able to share data. This brings to bear this concept of a lakehouse architecture, allowing you to use BI reporting, data science, machine learning, all working off of the same data in these open-data formats living within a data lake.The lakehouse architecture also provides one of my favorite concepts which is this idea of Optionality, allowing you as the customer to ultimately have control and ownership over your data.
You can do this by paying attention to really three core concepts. First of all, leave your data in the lowest cost storage option that you can. That’s going to be a data lake. It may be S3, and maybe Azure Data Lake storage, might be Google cloud storage, or it might be S3 Compatible Object Storage: On-Prem, like Ceph or MinIO or other technologies that allow you to store massive amounts of data in one place. It doesn’t require data movement if the data already is there to begin with, and you can now bring to bear data governance to this data lake to allow you to do the types of things that you would traditionally do in a data warehouse. The second thing is use open-data formats wherever you can. These are the long lasting legacy of Hadoop. ORCFile, Parquet file, and now Delta Lake allows you to access data and work with multiple tools to access the same data.
You can do that very efficiently. These are columnar formats which give you very fast read performance. So there aren’t really any trade-offs relative to using a data lake and querying that versus a traditional data warehouse. These file formats allow you to avoid vendor lock-in, ultimately. And then third piece is creating a data consumption layer, really kind of raising this up a level of abstraction and allowing you to access data regardless of where it lives. This is a big part, of course, of what Starburst brings to the table, allowing you to now combine data that lives in a data lake with data that lives in other data sources, be that a transactional system like Oracle or MongoDB, or even a streaming engine like Kafka. By creating an abstraction layer, using a query engine that allows you to query the data where it lives, you’re ultimately going to be future-proofing your data architecture, because regardless of where that data lives down the road, you still have access to it. With that, I’m going to turn it over to my friend and colleague Daniel Abadi.

Daniel Abadi: Thanks Justin. So Justin talked a little bit about the advantages of the data lakehouse architecture. And what I’m going to do now is talk a little bit more detailed about one of those advantages, which is its ability to facilitate data sharing. So I’ve been doing research and analytical database systems for just around two decades now. And pretty much my entire career, the importance of data sharing has been known and has been highlighted. Even two decades ago, people were saying that that enterprises need to have a full view of their business, they need to be able to understand everything they need to incorporate different data sets into a particular analysis. So in some ways nothing has changed, but in some ways everything has changed. So in the past few years, we’ve seen the sort of machine learning revolution, the way that many businesses now are really trying to incorporate more machine learning into their data analysis.
And this is going to impact data sharing because the truth of the matter is that machines are really, really bad at machine learning. So despite all the hype and everything that people talk about in the day, they’re nowhere near what a human can do, our machines are well, well behind humans. So the only advantage machine has relative to a human is that machines can go through more data. Are able to access and be able to read much more data than humans able to go through. So the only way you can get a decent result from machine learning algorithm is if you throw a bunch of data at the algorithm. And that really has increased the need for good data sharing solutions. As a result, we’re sort of seeing reports, for example, pharma. On this slide, we’re showing a recent report from Gartner that shows that they really argued that data sharing now, companies have changed… had to flip the way they think about it.
It used to be, the companies would say that they worry about the risks of sharing data, both within the organization and across organizations. Now it’s the opposite. Now companies have to worry about the risk of not showing data. So that’s just so important that an analysis job has access to all data that it needs to return the correct result, that there’s more risk in not sharing data than there is in sharing data now. So that’s why we as Starbursts really sort of excited about the announcement of this conference of Delta Sharing. This is a beautiful new open platform for sharing data with both within the organization and also across organizations. And it does it in a way that sort of allows us to avoid copying or moving data. So the data can be shared directly without needing to move it somewhere else, for example into a data warehouse.
The Delta Sharing sort of ensures security, auditing all those things, governance, the things that you need to ensure to make sharing successful. So let me go just for a few minutes very quickly into sort of more detailed about how Delta Sharing works. On this slide over here, we’re sort of showing that there could be some different consumers performing data analysis tasks. Of course, one of them is Starburst. And the way we interact with the Delta Sharing platform is that we communicate via a Delta Sharing protocol with a Delta Sharing server to navigate access permissions to the different datasets which are accessible by Starburst. And then after this, we make a request to the Delta Sharing server to read a particular data set. For example, here we’re showing that we’re trying to access sales and the Delta Sharing server responds back with a series of short-lived URLs which point to objects in S3 in the Delta Lake that can be accessed directly by the data consumer.
And that sort of allows Starburst to incorporate these external datasets, both within an organization and from different organizations into a given analysis task, whether it’s joining them with existing data or whether it’s unioning the existing data, either way, sort of the data has to be incorporated in that analysis task and allows the task to be enriched with these additional datasets. So at the end of the day, what’s great about Delta Sharing, is it’s a very open sharing environment. It’s using… The format is Parquet, which is an open format, the whole thing is open source. And sort of allows the data consumer to communicate with a sharing platform using a standards-based approach that ensures governance and compliance. With that, I’ll hand over to Kamil, who’ll talk a little bit more about how Starburst supports the data lakehouse .

Kamil Bajda-Paw…: Thank you, Dan. Just a quick note about me, I obviously share a lot of history with Dan and Justin, started my journey with big data at Hadapt, which was a commercialization of a research product from Yale, and then the business was actually acquired by Teradata later. And then since then, I’m a co-founder and CTO at Starburst. In my part, I would like to discuss how Starburst supports data lakehouse paradigm and customers leveraging this architecture. So first, why data lake? So we are excited about this new technology that was introduced by Databricks last year, because it really bridges the gap between the classic data lake and the modern data warehouse. So it brings, ACID properties for data manipulation, which was a huge gap in the original architecture of data lake.
It is an open source table format, so unlike with data warehouses, you are not vendor locked-in into proprietary storage mechanism. Undercover is a bunch of parquet files, so think columnar organization, compression of the benefits that you’re used to from data warehouse technologies and data lake together. Now, you can deploy it anywhere you want, whether you have either HDFS or object storage APIs which is pretty much where people store big data today, especially in the cloud. And then under the cover it also has ability to support features like schema evolution, time travel, and it brings a bunch of performance benefits because it integrates metadata and statistics in this layer between the storage and the compute that allows you to do additional data skipping, z-ordering, and a bunch of other optimizations that come handy when you want to run your data warehouse style workloads over this new table form.
So why Delta and data lakehouse, right? So again, it’s an open architecture, it’s meant for high performance, and sort of minimizes that complexity of data movement and managing the architecture where you have to have multiple sources of things and it’s hard to reconcile between them. It’s supports AI and ML workloads, in addition to serving sort of the classic SQL-based data consumer analytics, and it reduces cost because object storage is one of the most cost efficient mechanisms to store a bunch of data. With Delta Lake at the bottom, where again, you can store all your data structure, semi-structure, annual structure… Now you insert metadata and caching layer that creates… and brings all those benefits of transactional capabilities and high-performance, and through the APIs, you’re now able to present this data to both SQL-based analytical tools, such as Starburst and Trino, as well as, so machine-learning data science tools, such as data frames that were introduced within Spark and now broadly adopted in other the tools.
So I think Justin introduced Trino very briefly. I would like to spend just one more minute to explain why it’s the highest performance SQL-engine in the open source community. At least, we have huge adoption from companies like Netflix, LinkedIn, Lyft, and others who are basically embracing this form of architecture in the open source world. And now we are making it possible for broader enterprise user base. With high-performers, what we mean really is we are focusing on interactive ANSI SQL queries at high concurrency and at scale that is not achievable cost efficiently by proprietary solutions. And through the separation of compute and storage, which plays nicely into the cloud paradigm and lakehouse architecture specifically, you can sort of manage both costs as well as maintain optionality in terms of picking where you want to put your data.
So we can run SQL-on-anything plus federated analytics as well. And then last but not least, the great benefits of this architecture is that you can deploy it anywhere, but as on-premises, cloud, Kubernetes, it’s very flexible that way and allows you to maintain this extra level of optionality into the infrastructure you’re operating on. So now to make the efficient usage of the capabilities that Delta Lake brought into the data lakehouse architecture, we at Starburst invested in writing a native connector for Delta which taps into the proprietor’s performance advantages and capabilities that I discussed earlier. So we support Delta transactional log, we’re able to find which files are part of the current snapshot of the Delta table, we can leverage the statistics they offer to support both data skipping and filtering dynamically the data that’s actually needed for your query to get you the most benefits on the IO side of things, as well as use some of those statistics to make smart decisions within our optimizer and decide on the join order and join strategy overall. Those are the benefits of using the native integration with Delta.
Now we obviously, after introducing this capability, we invested in performance analysis to understand what are the benefits it brings. So on the most simple benchmark, we run internally based on TPC-H, we observe at least 2X speedup across all the queries in the benchmark with the highest performing query, which was mostly scanning a single large table up to six times speedup over just reading the same data as a collection of Parquet files in a full table scan manner. But while we’re actually very positively surprised was that when we deploy this to our early customers for this particular integration, we were surprised to see that they get speedups for over 10 times. And that basically brought the query responses down from minutes to seconds and created an excellent demand for this capability among the analyst community and basically make it more approachable for everyone to run more queries to reveal additional insights.
Now if you look holistically how the data flows in your data architecture, you’ll be seeing that we are sort of embracing the Databricks suggested architecture for ingesting data into bronze layer or set of tables where there’s data as it arrived into the system, whether that’s for batch or streaming ingestion pipelines. Now the next stage is to refine some of this raw data into the silver layer, which can already be used by our customers to run discovery analytics. However, in some cases you want to refine it further, maybe aggregate more and create so-called gold layer, where this data is ready to serve use cases such as predefined reports or dashboards. And Starburst… I can actually read from all those Delta tables indirectly, and depending on the use case you’re picking the right set of tables to power your SQL analytics.
At the same time, the same data is obviously accessible to your other computational framework… machine learning, AI, done with Spark and Databricks. And because it’s all open format, you can pretty much bring your own tool to run additional analytics for both your data scientists and data analysts across the organization. So just going just a little bit more into the details of data ingestion and transformation, we imagine that some of this data, especially if it’s machine-generated events, we basically real-time ingest it into Delta, and this is happening through Spark streaming and sort of other ingestion technologies that are efficiently populating the data at the back-end. You might have some other data sources like ERP systems, where it’s probably okay to load this data every hour, so it doesn’t change that often and the volume is much lower. And with all the data landing there, now you can combine this information together and correlate the necessary information and create those additional layers of refinement so that you can serve analytical workloads on the back-end to be more efficient. And that’s all achieved by Spark and Regis Technologies.
Now at the data consumption layer, this is where we’ll be running analytics using SQL, so we’ll be going through a set of BI tools or support editors, just Looker, Tableau, or Power BI or DBeaver as an example. And you’ll be running your classic SQL workloads now, not against your data warehouse, but a combination of Starbucks, Trino, and Delta Lake storage tables. And you can obviously run your customer analytics through any of the drivers, JDBC, ODBC, and language specific libraries. And what you can do is not only run basic SQL against your Delta, you can actually at the query time bring additional sources from outside your data lake. If you have some legacy systems and you need to quickly query this information, or you may have some NoSQL engines like Elastic, for example, you can bring this data at query time without loading this all information upfront if that’s the query pattern you’re experiencing.
Now beyond these select and just read-only analytics, Starburst offers additional capabilities such as global security access control by Apache Ranger and Privacera Integrations, but works both against the Delta Lake and delta and data lake as well as other data sources. And for Delta Lake specifically, it’s all about manipulation and data as well. So in our latest news we’re excited to bring additional capabilities such as create table as insert, update, and delete. So you can run all those operations from Starbucks, Trino directly. And with that, I would like to invite Richard Jarvis from EMIS Health to talk about their vision and adoption of data lakehouse architecture with Delta specifically powering their analytics.

Richard Jarvis: We started out about two years ago when we built our cloud analytics platform with a lakehouse architecture. And actually the pandemic has proven in my mind why that architecture has worked well for us, because when we started we weren’t designing to battle a global pandemic, because in 2018, we didn’t know one was coming any more than anybody else. But what we have been able to do by collecting data from tens of millions of patients and from pharmacists and researchers, is to understand the spread of COVID and the impact on other diseases and people maybe not having the treatment that they would normally get and be able to empower the NHS and other researchers to basically help patients get better quicker, improve the vaccines, understand the vaccine rollout strategies… And the number of different access patterns to the data that those different users have means that we need something very flexible, incredibly secure, and scalable when people need it to scale.
Just over the weekend we vaccinated in the UK something like 600,000 people on one day. That’s a lot of data to record. It’s all very sensitive, but it has to be real time. We have to be able to put it in the right people’s hands so that they know that that information is securely ready for analysis. It’s a very large amount of data, but it has to be curated. We need to put this data from all of the raw form, from all these different sources, into one place so that it can be safely analyzed. And really, that means using tools that actually data virtualization really does help us with. So we can take the hundreds of billions of records that have come in, we can process them in a data lake-style approach where we might be doing something over hours, and we can insert it into something more relational-like in performance terms or into something like Databricks’s Delta Lake format, and then provide high-speed query performance for our data scientists and researchers.
So in this case where we’ve got structured data, but it’s very large scale, actually the bringing together, which data virtualization allows us, of data lake and data warehouse type capabilities has really been very valuable.

Kamil Bajda-Paw…: Everyone. Thank you for joining us today. Thanks.

Justin Borgman

Justin Borgman is a subject matter expert on all things big data & analytics. Prior to founding Starburst, he was Vice President & GM at Teradata (NYSE: TDC), where he was responsible for the company...
Read more

Kamil Bajda Pawlikowski

Kamil Bajda-Pawlikowski

Kamil is a technology leader in the large-scale data warehousing and analytics space. He is CTO of Starburst, the enterprise Trino company. Prior to co-founding Starburst, Kamil was the Chief Architec...
Read more

Daniel Abadi

Daniel Abadi

Prof. Abadi performs research on database system architecture and implementation, especially at the intersection with scalable and distributed systems. He is best known for the development of the stor...
Read more