Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch and More!)

Download Slides

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

Delta Lake, a storage layer originally invented by Databricks and recently open sourced, brings ACID capabilities to big datasets held in Object Storage. While initially designed for Spark, Delta Lake now supports multiple query compute engines including Presto.

In this talk we discuss how Presto enables query-time correlations between Delta Lake, Snowflake, and Elasticsearch to drive interactive BI analytics across disparate datasets.

Speaker: Kamil Bajda-Pawlikowski

Transcript

– So hello everyone, this is a presentation about the Presto, which is the Fast SQL-on-anything engine. And it can query Delta Lake, Snowflake, Elasticsearch and much more. So, I’m happy to take you through this presentation today. So I’ll touch on, what’s Presto, and then Starburst, just my company’s sponsoring this project. I will then discuss our recent Delta Lake Integration. If you’re a Databricks user you more than likely using Delta Lake storage. And we will discuss how we natively integrated it with that table format. Then I would like to zoom out for a bit and tell you more about the Data Platform Architecture as we see it and how Spark and Presto and other technologies fit together and deliver most of the value for your data processing. And then that will cover use cases that we often see among Presto users and Starburst customers. Okay, so, on to press Presto and Starburst. So, Presto is obviously an open source project driven by the via wide community. In its core, it’s basically a high performance parallel and mystical engine. It’s speaks ANSI SQL, so it’s very much compliant with all the traditional SQL statements you might be using. It’s focusing on interactive, very fast SQL processing, unlike some of the engines like, like high reference park, which are more geared towards batch processing. Presto is proven at scale it’s runs with hundreds and thousands of machines that the major companies in the world, it’s been around for over eight years by now. So it’s really proven already in some of the most demanding use cases, you may have. And the other focus in addition to interactive query speed, is also high concurrency. So you can fire tens and hundreds of Presto queries at the same time and see the system handle that very well, which is just also in contrast to sort of how do born a SQL processing that you may be familiar with from the past. One really important design point for Presto is separation of compute and storage. It was really initially down to allow Presto to connect to many different storage engines, but really the benefit of that is beyond just this Federation capability. It actually makes Presto very well suited for running in the Cloud and any cloud like environments, which is, maybe that’s something that people were not much concerned about eight years ago but right now this is basically super important feature for everyone. Because of this separation and SQL compute being separate from storage, Presto can run beautifully across many different storage engines. So, we’ll discuss some of the offender rated query use cases in this presentation later on. Another aspect of Presto is it’s actually very easy to deploy anywhere. So there are many customers and Presto users leveraging Presto on traditional on premises clusters. On bare metal increasingly and that’s also on premises Kubernetes, deployments, and obviously Cloud is a major environment for deploying Presto as well. So, it can be actually deployed pretty much anywhere, including some of the hybrid environments as well. Which is really cool in the modern world where many customers have really data spread across many environments and there are sometimes Clouds sometimes on-prem very often, it’s a mix. So that’s really powerful. So who are the some of the biggest Presto users, is it all started at Facebook in the early days and it was very quickly adopted by companies like Airbnb, Dropbox and Netflix. And I think just spreading this really powerful engine initially in Silicon Valley, and then expanding to, to the rest of US and then globally, I think made Presto really powerful. So we now actually have big open source users, such as Facebook, LinkedIn, Lyft, and many, many others as you can see some names on this slide. And then increasingly over the years as Preston matured, I was able to handle long complex use cases, we saw the benefits of Presto being leveraged by larger enterprise companies, that increasingly we’re faced with similar challenges as internet scale companies in the early days. Which was like data is really measured in petabytes, it’s diverse, it’s spread across many sources, and you really want to run powerful SQL analytics over the data. And as you can see some examples of like how big Presto clusters, what kind of concurrency people may have, all phones into hundreds of thousands of queries a day. And it just really proven at that scale all ready. So, here at Starburst, we are the biggest sponsor of the project. We actually currently employ all major contributors and committers to the project, I think. We’re responsible for over 80% of all the code that was already written to date for Presto. And we are still investing heavily in the open source community. It’s super important for us to be able to drive the project forward, make sure that it realizes the vision of writing interactive, high concurrency analytics across your big data in sort of this is fully separated, open source manner. As a company we were born three years ago, in that short period of time, we experienced really tremendous growth and serve over a hundred enterprise customers, by now. And the focus is obviously, as I mentioned, it was just the open core Presto with all the values of ANSIL SQL and flexible deployment high concurrency, and offering the value of Rapid Time to Insight, minimizing your need to do expensive ETL, just to do initial investigative analysis, lowering cost of running the queries at scale and adding enterprise grade security and expertise in supporting you to, to make you more successful over this technology. So, the major features we’ve focused on in Starburst is really investing in performance. Performance is really important aspect of every single SQL engine and because performance really means lower costs especially if you’re running in the Cloud and even if you’re running on premises, being able to run more queries means you have more value for analytics. So, it’s really best that you’re winning when investing in performance. Additional features Presto comes with a number of connectors to different data sources already, out of the box in the open source, however, we have some enterprise additions like connectivity to IBM Db2, to Teradata to Oracle, Snowflake et cetera. And then security there’s built in security in Presto already, but if you want to get, so fine grain access control, data encryption masking, auditing, and this is what Starbursts can help you with. And then any tool tooling around management monitoring, configuring the clusters and simplifying the deployment. And then finally, obviously we are the leading expert in a Presto itself. We employ both of the major contributors as well as the original creators of the project that came to us from Facebook. And we just gonna stand behind the project supported and sort of continue to harden this to serve you better in your enterprise like environments. And some of the customers that we serve are really spread across many different industries also companies in tech, as you can imagine, Slack, Tesla, Grubhub, a lots leading on global finance companies and banks retail, Switzerland and Germany is one of our biggest customers and also media companies, healthcare and really everywhere in other industries. So it’s widely deployed at large scale at leading organizations in the world and now we just strive to make it even super important, more powerful to more of the customers. Okay, we’ve done background on Presto itself and Starburst, I would like to discuss a little bit more on how we view the storage piece, which is not something that Presto handle this natively, but really integrates with a variety of options on, in this market. So Delta Lake has obviously an open source project and created by Databricks. And I think it was really powerful addition and to the way storing the data in Delta Lake and other storage. The feature, broad really was, ACID properties over object storage. Being an open source project and leveraging okay, file format and object storage scalability, and the cost efficiency, I think it really brought this fundamental capabilities that enterprises needed out of the Delta Lake but couldn’t get with the original storage. The article things in Delta that we really like is Schema evolution, time travel and really what’s impact is performance a lot on top of just, Parquet is fast and object storage is efficient, however, having really detailed metadata and statistics, the metadata level really allows us to run the same queries against the same Parquet files in the end but with much better performance due to advanced data skipping and being able to leverage partitioning and z-ordering and other features it’s really super cool. So, really thank you for Databricks, and we work a lot with engineers from the Delta Lake Project to build our integration for that in Presto. So, that the effort really started from our interests. Well, it feels like it’s really a bunch of Parquet files and Presto can read them no problem today like what’s natively available in Delta that we could leverage to make that those questions faster. And after talking to the community and to the database introduce or behind the project, we really realized that we should really create like a native connector for Presto that’s written from scratch. And it allows to leverage those features that are beyond just being able to read Parquet. So we support reading the Delta transactional log directly, we process information, we inform how we should, which files we should read and how we should help leveraging this metadata that’s available there and to skip data, reading the data, some files may not even contain information that’s needed for a given query. So all those minimax, indices and et cetera help us a lot. We leverage, in addition to that we leverage dynamic filtering in Presto, so if you have direct derive predicates, we can compute it around time we’ll be pushing this down and support further data skipping based on that and Delta, metadata and information. And then the statistics itself are super useful for really optimizing the query plans. So, we are able to reorder drawings in a Presto query based on the sizes of the bevels and all the available statistics help us to understand, should this be a broad question or should this be a distributed drawing, and that way we are able to get even faster performance. So, when we developed this Native Delta reader and connector for Presto, we ran some of these standard benchmarks like TPC-H, internally, and to really realize, well, it’s great we can, in this up at table scale level but what about the overall workload? And what we saw was really encouraging numbers so, some of the queries were as fast as six times versus just reading the same Parquet files without Delta Lake metadata, and the overall across 22 queries that define this benchmark, we saw average speed depth of two X, because the queries actually computed your drawings and do a lots of complex operations. You obviously cannot save that part but the table scan was vastly improved. So, we were so accurate that what we were even more impressed with was that when we deployed this release and this connector into some of our customers that are leveraging Delta Lake today, they were actually able to see performance boosts 10 X or more on some of the workloads, thanks to data skipping and really more efficient processing. So that level was really powerful to see the benchmarks that reflect some of the workloads customers are running and we can even get better performance and that. So just let’s zoom out for a bit and while someone has data breaks, someone has pressed how, how to really mix all those technologies together. And in this talk, I promise to discuss how we also talk, interact with Snowflake, Elasticsearch and other other tools. So, our vision for the Data Platform, where Starburst establish builds and Presto is at the core, is really shown in this diagram. So your end users, whether they are data analysts, data scientists and from whatever departments, whether that’s finance or marketing, they’re really not changing their interaction pattern, the way they interact with the data, they’re still in front of the BI tools, SQL editorials, the examples include Tableau Luker, power BI click et cetera. They can always use leverage just, typing the SQL directly if they need to and leverage Python and other tools to issue those queries. And at the end, the SQL statements that are generated by those tools or typed in, are sent to the Presto, which is the core of our data consumption layer, and we will process that we’ll take care of all the computation as needed and we’ll apply all the security rules like data masking encryption, auditing, and access control down to the column and the role level, and really allow you to leverage the power of all the data that you have in your ecosystem. The vast majority of the data may live in your data lakes, Hadoop, whether that’s On-Prem or Cloud, we support all, so of Hortonworks Cloudera MapR distributions and all the leading cloud Hadoop deployments in Azure Blob storage, IDLs and obviously Google cloud storage. And then all this social prime is growing as well so, we can tap into Red Hat Ceph or IBM Cloud Storage or Dell EMC manual and a bunch of S3-compatible object storages as such can deploy yourself. And Delta is one of those recent additions when well it’s really just an extension of the version and there are Log Storage Format, the table formats, like coming to the Presto to really help you get most of that ability to store inexpensively, large amounts of data, and then expose that to two hours ago, engines like Presto. But then not all the data lives there so in many of the enterprises, you have also relational databases, whether that’s positive MySQL, SQL server, or maybe that’s what the data warehouse like from Teradata or from anti-B2 or from Snowflake, Oracle, Redshift, Big Query, I guess I don’t see it on this slide, Azure Synapse, all of those are supported as well, by Presto. So you may keep some of the data there and still be able to access that through. And then interestingly, you can also query data that wasn’t meant to be queried by SQL originally like Mongo DB or Elastic, Cassandra, Redis and others, and then CAFCA and event logs are also supported. So you really may tap into all the data that you have in your organization and run SQL statements across all of those and mix and match, correlate information, and get most value out of this. Given there are so many different tools and technologies for running SQL these days, while they won’t, they’re like, why do they leverage the power of all of them? And our view is that in most of the cases they actually very nicely compliment each other, certainly Spark and on the Databricks platform or EMR or from other vendors, it’s really powerful to handle streaming ingestion, machine learning and data transformations, larger batch jobs, initial data investigation I think is a perfect solution for us and can do many other things as well, Presto really excels at fast federated queries, where majority of the data will be in your data lake, but you may have data beyond that, and that high concurrency SQL queries, and then rapid investigation and optionality, because you can, it’s open source, you can deploy it anywhere you want or you can swap where the data is, and the end users maybe still unaffected because they are really working with the some anti views on top of the real storage. And then in this data warehouses technologists like Snowflake or Teradata or a Redshift, et cetera, are also important because there are really, really fast technologies. And they sort of run queries probably faster than anything in the open source and can be very powerful addition for some of the use cases at the expense of obviously you have to upload the data into them and invest in that ETL process and have the data logged into that environment. And we feel like the, it’s basically a great addition in the ecosystem and with Presto, you can actually query them as if he has another data source. So we really see them as a complimentary play in the market. And then just broadly speaking like if you’re on the Cloud Data Ecosystem, and this is an example from the Azure, obviously, but I know similar pictures can be drawn for Google or for AWS you have a lot of different technologies, different tools that they play different roles. And where we see Presto fitting in is essentially sort of as this central point where you can access the data, whenever you had stored, and Delta Lake is a very important mechanism for storing vast amounts of data. You’re gonna leverage idealize as free Google Cloud Storage, and then all the other sources whether they are deployed in the Cloud, or actually it’s still On-Prem and sometimes in other health environments as well. So, we can sort of go beyond a single cloud to grab the data from wherever they are available and then exposes to your BI SQL users and make this really powerful, single point of access other than forcing you to load the data into one place before starting the queries. And then the way we advocated for Presco to be deployed is Kubernetes is in most of the cases. We think that to us from deployment platform skills the best benefits in terms of auto-scaling, full tolerance and simplifying the operations overall, is portable across clouds and On-Prem, so that’s very powerful as well. Shove more about this separately. Still on the use cases, I think that’s really important to understand how you want to apply those technologies in a real environment, so, as I mentioned, the data is going to flow through different layers, your IoT information may be ingested and streamed into your Databricks environment and stored in the ingestion layer of Delta Lake. And then it’s getting refined through aspire jobs, into a silver layer where it’s already good for analytics, but not optimized for BI necessarily. That’s why you have aggregate store and the gold layer for Delta Lake. And then if you try to consume this data at fast interactive speeds, Presto, is really a great addition to that and because not all the data was stored in Delta, we can tap into, elastics, Snowflake, Osquery and other sources to bring additional auxiliary data to choose or collect information across and serve the BI users. So this is a perfect example of how Databricks and Starburst can really be deployed together in one environment. And as I mentioned, you’ll be operating in data place environment for the ingestion, transformations, modifying this, merging this data together and in creating those tables for Delta data. I’ll be there and then leverage by additional compute layers. And Presto would just handle the query time data Federation really, right, so single point of access, the other sources, we believe the power of bringing your relational data, as well as no SQL weblogs and Delta, like in other formats coming from object storage is very powerful and then we’ll apply security and all the additional rules so that UPI users really only accessing what they are allowed to and can run queries across without worrying whether the data is originally stored. Because you can add all this things to specify where the data is, in Presto in your query, but you can also create a viewers on top and hideout complexity from them. And this is really an example here, how, data consumption analytics layer is enabled by Presto and the BI tools. And then because of all the libraries will be applied on our Java JDBC, ODBC, we can power pretty much every tool on the market, and then there’s a sample query that I have here is actually showing you in detail how you can construct queries and point to exact tables stored in a specific data store like this IoT data from Delta and customer data from Snowflake and weblogs stored in the last query. And, one query may arrive at analytics insights and correlations tool in this example, tell you all about the customer IoT events, from that happened today that were gathered today, but only for customers that were active on your website earlier this year. And basically aggregate this information and show you, this is for this segment of your customers this is how active their IoT devices are. So it just shows you the power of like, you don’t need to log all this data in one place and wait, days and weeks for ETL, you can run this query today and find out other interesting insights there that you should act on in your business. So with that, I thank you all for attending my presentation. I encourage you to start leveraging Presto in your architecture. We have a promotion right now where you can start using Delta like reader right now and download the software and try leveraging the power of Presto and Delta together. Thank you very much.


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back