As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly analyze massive amounts and types of data. With their data lakes, these organizations can store massive amounts of data in cheap cloud object storage. And with Delta Lake, we can provide ACID transactions to make your data lakes reliable. But a key component to making your data lake into a lakehouse – providing the simplicity and features of a data warehouse to your data lakes – is the ability to perform ad hoc or OLAP queries against your data lakes.
With the Photon powered Delta Engine your lakehouse can now – easily – become a reality. The Photon engine is written from the ground up in C++ to take advantage of modern hardware and capitalize on data-level and CPU instruction-level parallelism. It optimizes text processing and regular expressions to enable fast performance on real world data and applications. It is fully compatible with Apache Spark™ APIs to ensure workloads run seamlessly without code changes. For example, a benchmark test running the same Spark SQL query on Photon was 20x faster than Apache Spark™ 2.4. Come join us to learn more about how you radically speed up your Spark SQL queries with the Photon powered Delta Engine.
Speaker: Todd Greenstein
– Hi, I’m Todd Greenstone part of the product management team at databricks. And today we’re gonna talk about photon. Which is our new purpose built engine for accelerating workloads in databricks. And this is our journey to more speed. So, SQL analytics is a completely new product for us. And what this is, is, this is a product that is born out of customer need and a lot of discussion with customers on what they’re looking for to get the best optimal experience out of SQL on top of their data lake. We kept hearing the same thing from our customers over and over again that in order to actually have the best experience for their users, why do they need to keep that data in motion? Whether they need to move it to expensive downstream caching systems? Just in general, why can’t the data be actionable and useful to analysts when it sits on top of the data lake and where it lives. And this is kinda been our guiding North star. And the way that we’re going after this is in three primary focus areas. The first of these is the SQL user experience. A lot of us have to wear the SQL analyst hat, more often than not. You know, I do a lot of data science and data engineering work within databricks. And quite often I need to run quick ad hoc queries or to build simple dashboards. And, you know, being able to do this within the SQL analytics product is, you know, an actual it’s something that I find extremely useful. And this is an area that we’ve been focusing on heavily. So anyone that needs to, you know, actually wear that hat, from time to time, which, a lot of us don’t have, you know, just one hat anymore we basically have these workloads and things that we need to do across the platform that requires to, you know, basically assume many different personas. And to be able to do this natively with the BI tool on top of databricks, is something that we wanted to make completely easy and democratize data for everyone to be able to do this and that’s where this product comes in. And then we wanted to make this really easy to administer. So from an administration perspective it’s, you know, a very stripped down easy to use experience from when you’re in SQL analytics, you can basically build clustering resources that will scale in a very meaningful and easy way behind a single endpoint. And on top of that, you know, you don’t have to worry about, hey, this particular group needs to use this cluster, this group needs to use this cluster, you can do all of this securely and define these resources and make them exceptionally easy to scale all from a very simple interface. And then of course, what we’re gonna talk about today is how we’re improving performance. And we’re doing this through the new Photon engine which is our new purpose built engine. That is proprietary to databricks. So SQL analytics, how this came about within databricks is, you know, when we started talking to customers, it’s been our kinda guiding North star that everyone wants to have, you know, be able to do super easy and simple BI and have it really fast and have this available when you know, infinite available compute resources on the platform, be able to do this at any time. And this has kinda been, you know, something that’s been our guiding North star. And this stack was born out of these discussions with customers in this need. You know, this stack is very much focused on performing Structured Query workloads within the platform. And it uses re dash at the front end, which is the BI tool that’s native within databricks, sits on top of the Delta engine, which is powered by Photon and then sits on top of delta at the core. You know, delta is the best open source proprietary format for building scalable data lakes. And this is something that we passionately believe in and we think it’s the best way and our customers have echoed this over and over again. So and then of course, one thing I’ll mention here is, you know, if you’re using the stack you certainly do not have to use re dash which is the front end tool that’s native within databricks but it’s certainly there doesn’t cost you anything just know how to use it and it’s a very easy way to administer the product and provide compute resources for your analysts. So, the thing about re dash is how this came to databricks is we started talking to customers a lot about what they were doing in terms of BI and how they wanted this to improve in databricks. We kept hearing over and over again, about this tool called re dash. And, we actually started looking at the product we actually made the acquisition of the company and brought this into databricks. And super easy to use, we’d like to say that, you know, anyone that spends 15 minutes in re dash is an expert. And from here, you know, this is the way that we administer the SQL analytics side of the product and it’s available clicks from an app switcher. And, you know, we’ll show that in a demo here in a minute. And sort of the next layer of you know this stack is the Delta engine. And, you know, this is our demarcation point from open source Spark. If you think about all the things that databricks provides on top of sparks this is where this is all located within the Delta engine. This is the databricks runtime it’s all the optimizations that we provide for being able to, you know, query Delta in the fastest way possible and to be able to perform, you know, a very efficient scalability reads and writes. And one of the things that we’ve done here to try and make this experience even better is we brought in a completely new engine, which is Photon. So Photon is a purpose built engine that we’ve written. It’s in C++. It doesn’t use any managed code. It’s a native vectorized engine from scratch. And Alex is doing a seminar during this conference that you should definitely attend where he talks about in, you know, very explicit detail on what vectorization is, and why it’s so important for us to build an engine that does this. And then the thing about this is, this engine is purpose built for speed. So if you think of you know how fast the databricks runtime is, you know, this engine runs you know, several times faster than that especially focused on Structured Query workloads. And the thing about this is, you know, it’s super important for our users that we were coming to market with a new engine, that we don’t build anything that has vendor lock in natively built into this. And Photon certainly does not. You know, being this proprietary engine that’s written from scratch and C++, and it’s purely focused on speed, and takes advantage of, you know, a lot of the new hardware optimizations that are out there this engine is transparent from the user perspective. It uses the open source spark API’s, it doesn’t create any vendor lock in, and you don’t have to do anything to your workloads to be able to take advantage of this. Your workloads should just seamlessly run here, natively and automatically. All you have to do is just when you’re building compute resources within, you know, the SQL analytics you know, front end, you basically say hey, look, I wanna use Photon as my engine to power this and you’ll be able to start testing and playing with this. So just wanted to, you know, put that out there and be explicitly clear that, you know, this engine does not create any vendor lock in to be able to test and take advantage of. And to get the most out of using this if you wanna start playing with Photon which we hope you do, I’ll say this, is that, you know, Photon is great at acquiring large data sets, you know, it’s really, really good at doing aggregations. And if you’re joining, you know, one set of data to another, this is what photon you know, exists for. It’s great for doing repeated slicing and dicing. You know, it’s super fast and great at computation and aggregations, joins, this is what it’s meant for. Photon, you won’t get the optimal performance if you start you know, trying to run this on really small datasets. You know, if you run this on a data set that fits on your smart watch, you know, you’re not gonna get the biggest advantage out of Photon. And then you know, anything that’s really I/O bound, and spends all of its time, you know, in I/O isn’t gonna benefit from using the Photon engine. And then, you know, really short, small queries the problem with this is they spend so much time within the scheduler that, you know, the engine isn’t going to provide any benefit here. So, just be aware that when you start testing this, you know, we really wanna see, you know, the best performance we can get, you wanna do this on large data sets and start testing that way. So, you know, what’s currently supported in Photon today, you know, operators wise, you know, we have Filter/Project, Expand/HashAggregate or HashJoins, you know, we can do Union/ScalarSubqueries, NestedLoopJoins/Null-Aware Anti join, we provide a huge, you know, performance benefit for Null-Aware Anti join. You know, this is something we just been just reminded me we’ve been testing this the other day and you get a massive performance when using this over what open source spark can do today. And then of course, the Shuffle. The data types that we have now are boolean, byte, short, integer, long, floats, doubles and string. And then we have limited support for date and timestamp. You know, it’s supported as a data type but all of the you know, function list that goes along with this has a very long tail. And this is Something that we’re, you know, implementing currently. We don’t have all of them supported. But you know, a lot of the most common usage for date and timestamp are supported. And then of course, we do support decimal. And there’s a huge performance benefit, when using decimal. As far as what we’re focusing on in the near term is we really wanna get binary and nested data type support. The first of these for nested data is we’re gonna build and instruct, and then after this, we’re gonna tackle arrays, map, and then potentially others down the road. And then of course, sorting and window functions are also not part of our not currently supported. The thing that I’ll mention about these is most of these are late stage processing operations. So they mean that while they’re still performance that we can wring out and you know, do better than what you can get out of Spark this isn’t gonna be the biggest bang for the buck right away, which is why we’re really targeting things like you know, better support for nested type. And, you know, as far as expressions we wanna have a really good, you know, list of most of the common expressions that folks are using today, and the ones that they wanna use in the future, we wanna make sure that we tackle those. So if you wanna test this today, you know, how can you do that? The best way to do this, is, within SQL analytics. You know, we do all the heavy lifting for you. So you can actually, you know, build a cluster, super easy to configure, it’s just a few mouse clicks and you know, behind us behind an endpoint and from here, you can actually start testing Photon right away. The thing about this is you don’t have to use, you know, re dash to actually do your testing. It’s certainly the easiest way in my opinion to do it because it’s there. You can , you know, your BI tool of choice. And you can start doing that, you know, if you wanna use Power BI or you’re you know, you’re using Tableau, you can certainly connect those as well it’s super easy to do as well on the side of the product. And then of course, we give you some really great debugging, that’s available within the history service. What I love about this is, we give you a lot of Photons specific information. So if you’re running this and comparing of what you might have gotten previously and what you were doing, you know, elsewhere within the platform you can start testing and you can basically see, okay, this random Photon this is how long each stage took. And you know, this is the part you know, this is the percentage of my query that actually ran in Photon. we provide all of that information to you in real time. So pretty easy to do this. And of course, we give it to you in spark UI. So I’m gonna go ahead and walk through kind of what a demo looks like. And I’ll show you how to do this. So what we’ll do is we’ll create an endpoint with Photon, and we’ll actually do some queries and kind of check the results. And I’ll kinda walk you through so you can see how debugging works and how easy this is. So let me go ahead and do that. Okay, so when you come into databricks, once you start using SQL analytics, you’ll see that there is an app switcher that shows up in the bottom left-hand side of your navigation pane. So when you click on this, you’ll say, Okay, I’m currently in the workspace so I don’t wanna that. But I’d like to try SQL analytics. So click on this. And from here, now you’re in our re dash front end. And again, purpose built, brought this into the platform for BI, super easy to use, it’s a tool that we all love. And the first thing you need to do is, actually have an endpoint to be able to start querying. So if I click on endpoints, we can go ahead and create one. So I’m gonna go ahead and click Create here. And here, you know, if I just name this todds test endpoint, if you see here, this is a cluster size. So automatically, what we do is we select nodes that are based on the best performance option that we can provide on top of the cloud vendor that this is running on. Like currently today, from AWS, we use part of the I32XL family. And then if you’re on , we use the LS or Lsv2 family. And here, you know, you can see for if you when you’re building a cluster, you say, Okay, these, you know, extra small all the way up to triple extra large. With these are this is one node, and then powers of two after that. So 2,4,8,16,32, all the way up to 128 node. I typically use a large when I’m testing just cause it’s kind of the perfect blend, you know, 16 node cluster I’ll go ahead and I click on this, and if you wanna use auto stop, you know, you turn this on to 120 minutes, which is pretty aggressive, or sorry, pretty lazy, I would usually use something a lot more aggressive like 15 minutes, but the purpose of this demo 15 minutes is fine. And then multi cluster load balancing. I’m not gonna use load balancing when I’m testing Photon, I can, but I don’t need it for the sake of just testing now. And here when I wanna use Photon, this is the easiest way to do it. All you do is you just click Photon. And then we say hey, this is still in preview please don’t run all your production workloads here and enable Photon. And now when I click on this, it will go ahead and create a cluster for me with all of these configurations. So, and then as soon as I click create, one of the things that’s neat here is, connection details. If I click on this, you can see these are the parameters that I would use to connect my BI tools. So if I’m gonna connect my BI tool and start playing with, you know, you know when to use Tableau to do this, this is where I would connect to that. And then of course, we give you great monitoring information. And then of course, back to the overview. In the interest of time, I’ve already got an endpoint that’s up and running, I don’t have to grab the VM resources from the cloud provider. So I’ll show you exactly how this works using that endpoint. So we’ll go back and we’ll see, see todd’s test endpoint is starting, but I already have this one that’s up and going this Photon demo, so we’ll go ahead and use that. So I’m gonna go ahead and click on queries. And here, I’ll go ahead and open one that I was working on earlier, which is Photon demo. And what I love about this is, if I edit the source here, I can basically now I’ve got my, you know, typical SQL I can again save this again. I can publish the dashboard from here. Very easy to use, super self explanatory. What I’ll do is, I’ll just go ahead and execute this query. But before I do that, I wanna show you on the left-hand nav, this gives you all the information of all of your metadata for a specific database, I can choose different databases here if I want, but I guess for the sake of this, I’ll just leave it on default. But I’ll go ahead and execute this. And it should take just a second to run, there’s my results. This was you know running across like 20 million rows, not a huge table. Basically I’m looking at flight data information I wanna look at canceled flights that matched a specific set of criteria that I said you know, within a certain month range with certain cancellation code, and that were actually canceled flights and I wanna group them by carrier. So once this is done, now I wanna say, hey, you know, how much did this actually ran in Photon? And I wanna see that. So one of the things that we love about, you know, this new SQL analytics product, is that we give you a super detailed history of everything that’s happened within this particular product. So you can see, I can group this by endpoint. if I want to, I can, you know, look at the last 14 days. I can look at, you know, whatever interval here that I want. I can look at specific status, you know, very, very easy to use super, you know. So if we click on this, one of the things here you see. This was the query that we just ran. And, one of the things that I’ll mention here is, it tells us okay the duration it took one second, but hey, I wanna look at kind of more information about this. One of the things that I can do is, I can look at the execution details. sure that this way. Here, you can see that it gives us how much of this ran everywhere and what I was doing, and actually how much you know, bytes it actually read. But what the piece that I love is, if you wanna see the Photon, if we run anything in photon, we actually explain right here, how much of this we actually ran in Photon. So you see that 95% of this query random Photon, is super fast, super easy to use, you know, this is you know super easy. And then from here, also that I’ll mention is, if I wanna get more information, I can actually open the spark UI natively, right there. So if I click on this, we’ve actually done the spark UI, we color code these. So you can actually see really quickly, what stages ran in Photon and which ones didn’t. And you can see here that like the scan piece, you know, we’re still using Spark, and then we give you, you know, really detailed scan information on this. And you can see, we give you timing so we tell you exactly how much timing we spent in each stage. And then of course, you can actually look at the plan. We provide that natively down here as well. So you can expand that and see exactly everything that’s ran, specifically in Photon. So I wanted to make sure that we provided really good debugging experience out of the box. And of course, this is all available to you. You know, if you’re using SQL analytics, you don’t pay anything extra for this. It’s just available and you can start testing and playing with this. So this is everything that I wanted to kinda show today. And then I think one of the things I’d like to do is if there’s questions, you know, we’d like to go over any questions that users have.
Prior to joining Databricks, Todd was part of Microsoft on the SQL Server Team working on the Engine and Clustering components. Todd was part of the of the product team at Couchbase and helped build the worlds fastest operational database. Most recently Todd was at Mesosphere helping to drive the application SDK and application scalability products.