Low-Latency & High-Concurrency Analytics Over Data Lakes

May 27, 2021 03:50 PM (PT)

With data lakes increasingly enabling a broader collection of processed and unprocessed data sets, delivering low-latency & highly concurrent analytics is not for the faint-hearted.

Learn how Firebolt utilizes recent advancements in lake-scale optimized approaches to storage and indexing, to achieve new levels of efficiency for analytics over data volumes that were once impossible to handle.

The session will include real-world practical examples and a live demo.

In this session watch:
Boaz Farkash, Chief Product Officer, Firebolt

 

Transcript

Boaz Farkash: Hi everybody. How are you? Welcome to this session, I hope you’re enjoying the summit so far. My name is Boaz Farkash. I’m the CPO at Firebolt, super excited to talk to you today about low-latency analytics, low-latency and high-concurrency analytics over data lakes. We’re going to talk about exciting new technologies that really can take us to the next level when it comes to low-latency analytics. So let’s walk over our agenda real quick. So we’re going to start by a really, really short intro, tiny, tiny intro, I promise, about who we are at Firebolt, who I am and what makes us position to talk about low-latency analytics to begin with. Then we’re going to talk about why low-latency analytics matters and why it actually matters more than it ever did. We’re going to talk about the challenges of achieving it today. Why is it so hard to achieve low-latency analytics at data lake scale?
And finally, we’ll get into it and talk about modern techniques to achieve it. We’re going to talk about sparse indexing in particular quite a bit. About hyper-pruning through sparse indexing and combined storage and compute optimizations, and also about an exciting topic called aggregating indexes. And I’m also going to show you all of this in action, so hold tight, we’re getting started. So let’s start with that super short intro about myself and Firebolt. So I’m Boaz, I’m the Chief Product Officer at Firebolt, I’ve been in the analytics space for over a decade, have always also been drawn to high-performance databases. At Firebolt our core DNA really comes from that [inaudible] high-performance databases. And many of us have spent many years in that space. We are a new cloud data warehouse vendor. You often see us competing where you’d see Redshift, Snowflake, Bigquery, Athena Presto, and so forth.
And we came into the market recently because even though the cloud has seen amazing advancements, when it comes to data in the last five to seven years, nevertheless, we all know it’s still super painful to achieve high-performance analytics at these kinds of scales. And just being on one of the modern technologies doesn’t mean that our life is easy. So we’re here to fill the gap and really help the world get to that low latency analytics. And we came in with technology that is all about handling more data much faster, while keeping things super easy to use and cost-effective. Our differentiators that make us unique are first of all, speed. So we were the fastest cloud data warehouse out there and not in a marginal way, but in an order of magnitude. And part of the technologies that power that we’ll talk about the concepts behind them in this talk.
Secondly, scale. So we’re built on a decoupled storage and compute architecture, which enables super flexible elasticity, which is the kind of elasticity we all love and expect from modern data platforms. Over one copy of the data, you can mix and match as many compute clusters as you want each with different sizes, shared with different users and so forth. And finally efficiency, so for us being efficient means not just being fast and scalable, but knowing how to do that without being super hungry for compute. And in the cloud, we pay typically per use, per compute resources, being able to deliver fast analytics that’s scalable, but without needing huge clusters, that’s where cost savings ended up coming from. And this is what oftentimes unlocks use cases that were really out of reach before, because they would require huge, huge resources. So that’s it. Let’s get to it now and start talking about low-latency analytics.
So what’s the big deal and yeah, all of us love faster queries, but why does low-latency analytics matter even more than it ever used to be? So first our data volumes keep growing and actually grow at the fastest pace that they ever did. And as long as they keep growing our wait times for dashboards reports analytics and so forth, just keep getting longer and longer. As long as we keep relying on the same paradigms, the same technology constraints that we’re used to this day and age in the cloud, essentially that promises us that things will keep getting slower because our data is getting bigger. Nobody can stop that, but technology stays the same. We need that next leap, otherwise our wait times would keep getting longer and longer. Second, our demands from data have changed. So business requires us to do much more with data than we delivered so far.
Traditional BI and analytics is still a must, but it also has become commoditized to some extent and businesses want more and more to do bigger things with data. Deliver data applications that are customer facing oftentimes that are operational, more businesses than ever rely on data as part of the heart of the business model, the data for many businesses is the business. Being able to deliver data applications that give users interactive experiences that are fast and meaningful are becoming a bigger differentiator for more and more companies. And whereas an analyst running a dashboard on their desktop, if they wait 10 seconds, nobody loves it, but many people aren’t necessarily willing to spend money to solve it. But in a customer facing experience or an operational experience waiting 10 seconds, 40 seconds is really unacceptable. And this is why low-latency really matters much more for these use cases and finally efficiency and cost.
So as our data volumes keep growing and as our queries are getting slower and slower and as we’re paying per resources used in the cloud cost models, things are really getting out of hand when it comes to cost. Longer queries mean longer compute, means more money. Longer queries also means that typically concurrency becomes a bigger issue because when the CPU is busy longer times, then it’s very hard to share resources. And then that’s where you see suddenly the need for auto-scaling and more clusters to deliver concurrency and it’s really expensive. So there’s other potential with the idea of reducing query times without needing more hardware, because that will support higher concurrency and shorter analysis times.
So I hope that by now, we all agree that low-latency analytics is super interesting and important. Let’s talk about why it has been so difficult, especially in the cloud and the data lake environments and at data lake scale. So first let’s talk about the lowest storage tier. So the cloud has introduced amazingly simple and easy to use and scalable and robust ways to deal with data. In AWS terms let’s talk about S3. So it allows us to really [inaudible], not worry about it. It’s reliable, it’s scalable. It’s a wonder. So we can store those terabytes and petabytes of data without much effort, which is great. But on the other hand it’s also very slow storage tier. So query engines in the cloud essentially rely a lot of on caching and many things like that. But at the end of the day, you will always run into queries that missed the cache and have to go all the way down to S3, to scan data.
And this is a huge bottleneck because S3 is very slow. So relying on it to deliver analytics is very tough when we expect low-latency analytics. The second issue is that even the faster storage tiers, when we’re talking about data lakes scale, can’t contain all of the data that we need ready for fast analysis. So in the cloud, most solutions rely on SSDs as the cache within the clusters or the query engine to be that storage tier that is faster than the underlying slow storage tier, but it can never keep up meaning that it can only contain partial data. And also with that it often requires us to move data back and forth from S3 to the SSD, and network is slow as well. So that’s another bottleneck. And I’m not even talking about in memory, which is even faster, but it’s completely incapable of storing the data volumes that we need at data lake scale.
So if we had that magic era where in memory was [inaudible] and really enable us to do fast analytics in an agile way. Today, we’ll talk about terabyte scale in memory doesn’t cut it anymore. And even SSD’s can’t contain hundreds of terabytes as easily as we want them to. And finally, let’s do talk about SSD’s. Even when we put data in the SSD’s and let’s say it contains as much data as we need for analysis, it turns out that even SSD’s and the way we scan data in them, aren’t fast enough for low-latency analytics. So there’s a bunch of things that we need to reconsider in how we work with storage and compute over data lakes to achieve low-latency analytics. So let’s talk about how. So we’re going to start by talking about the concept of parse indexing, but before that, a little bit of what it will help us achieve.
So given all the challenges that I mentioned, let’s talk about the objectives that we want to achieve with a different approach of using the underlying technology of a modern query engine. So, first of all, we really want to reduce data scans. Scanning data in S3, horribly slow. Scanning data in the SSD is even something that’s not fast enough. And scanning data is at the heart of every query engine bottleneck, if we find a way to dramatically reduce data scans, that’ll be amazing. Another bottleneck we want to overcome is the network as well. We want to move less data from S3 to SSD instead of moving huge partitions or huge chunks of data, just to date that they can be then scanned in the SSD. Maybe there are smarter ways to handle that. So the solution that we propose is based on sparse indexing, we want to store [inaudible] is sorted and sparsely indexed, and we want to scan and move small data ranges and not big partitions or segments.
So let’s dive deeper into this subject. For starters, what is sparse indexing to begin with? So sparse indexes, aren’t a new thing. They’re a well-known concept that now we have the opportunity to readapt for the data lake in the cloud. So sparse indexes are indexes that do not keep pointers to the entire data set as a traditional index, but rather only two ranges of values. So they are smaller in size compared to a traditional indexes because they only have sparse entries. Another thing that is important is that they rely on sorted data, data that is sorted by primary key. So the keys point in the sparse index, the keys point to beginning over sorted data range. And also, sparse indexes, therefore are not unique. Tables can have multiple entries with the same sparse index key, but the benefits are first of all their size.
So because they’re sparse, they can point to huge data volumes, which we have, typically in data lake environments, while remaining relatively small in size. So typically you can contain all of them in your RAM while still pointing to huge datasets. Unlike indexes that contain every row, typically they’re way, way, way too big to be used in RAM. Another benefit is that they’re very efficient in helping us prune data. So when we want to access the data to do data scans, if we rely on sparse indexing, we can essentially find the ranges of data that we need in a very surgical and pinpointed way. Instead of scanning a lot of data to understand which portions of it we need, we can rely on the index to send that to a very accurate range which contains what we need, and this is what I call hyper pruning.
So it helps us really, really reduce data scans by pruning with the indexes. So let’s talk about other [inaudible] understand what partial indexes are. Let’s see how that all comes together in the cloud and how we adapt sparse indexes for an experience where users don’t really need to worry about that too much, but get low-latency analytics in return. So, first sparse indexes, like we said, rely on data to be sorted and then sparsely indexed. We also compress it after we sort it. So compression, which is [inaudible] a topic of discussion in its own, but obviously it’s super important for cloud query engines. With these data volumes being able to efficiently compress, makes a huge difference. And the nice thing is once data is sorted, compression becomes actually even more efficient. So the compression ratio we see is actually better than parquet, but we start by abstracting, the sorting, compressing and indexing of the data, abstracting it into a dedicated file format.
In Firebolt we call this file format Triple F written as F3. So essentially when we ingest data into Firebolt, through that process, we sort the data, compressed the data and index the data and store it in that Triple F format.
So after the data is stored in Triple F format, then it’s all about optimizing data access across the storage tiers, through the sparse indexing and the file format. So let’s talk about how that works. If we look at this diagram here, but then you can see at the bottom that’s where data is persisted to in S3 on AWS in Triple F format. Now the data is stored in what we call segments. Segments are like files. Each segment contains a partial index and ranges of data. So SegWit is a file that contains ranges of data, which the sparse index points to. Then the compute layer, that’s where your cluster is, that’s where the RAM and the SSDs are. And when a query needs to fetch data from S3, it uses the index not to fetch an entire file, an entire segment, but rather only the very accurate or very specific ranges of data that it needs from within the segments.
And this is achieved by using this sparse index. Another thing that is important here is that there’s storage to your awareness. Meaning when we create a physical query plan, the index knows which ranges to pull, but it also knows whether it has to go to SSD or to the SSD where data at the end of the day is stored a little bit differently, because in the SSD it’s in segments and in SSD it’s only in ranges. Another interesting piece is that when we talk about the compute clusters, imagine that these ranges are split across multiple nodes in a cluster. We also have to take into account, which ranges we put next to each other. So not all nodes contain the same ranges, but being able to co-locate ranges that are more often used together than others is another piece that’s very important in speeding up the queries.
And like I said, only the relevant ranges that are part of the query that we run are moved from S3 to SSD, avoiding big data movements and reducing network bottlenecks and status [inaudible] bottlenecks. For comparison, most query engines out there copy from S3 to SSD, typically either entire partitions or micro partitions, which are also big. We with this approach can copy very [inaudible] ranges, which are dramatically smaller and dramatically reduce data scans. And finally, another very important piece is how do we take care of those fives behind the scenes? So the end of the day, if you’re done vacuuming and emerging of files before, we really don’t want you to deal with that. So there’s also a backward process behind the scenes that continuously merges new data chunks that enter the file format with bigger files and so forth. So there’s continuous optimizations behind the scenes of the file format. So you don’t have to deal with vacuuming and things of that sort.
So, so let’s see the performance of those sparse indexes in action. So I’m going to show an example that runs in Firebolt. This is our user interface. You see the Sequel editor. I’m using a database called [Ad Tech] DBV 4 based on real world use case from an ad tech client that we have that allowed us to use their dataset after being [inaudible] of course. Let me do a, select count from the main table that I’m going to use. Sorry about that.
Okay. Let me do a select count from the LTV table. So we’re looking at the data set that has 32 billion rows in it. I’m also going to show you how big it is. So 32 billion rows. If you look at this LTV table here at the bottom, you can see that it’s 17 terabyte, big, uncompressed, compressed down to one terabyte. And this is the query I’m going to show you of how it’s affected by sparse indexing. So take a look at this query. We have a query that targets the LTV table, which is 32 billion roles. There’s a couple of left joints, select a variety of dimensions, does a variety of [inaudible] with case statements, calculates a variety of metrics, filters by date and media source, groups by and so forth.
Okay, so this query is actually based on a [Looker] dashboard. This is a query that has been generated typically by Looker, and it shows the business how in this [attic] use case people interact with media. Let’s run this query for a second. This is the query that returns in 0.3 seconds, has data like country, region, platform, clicks, impressions, things like that. Now typically different account managers change the media source that they want to analyze. Now I’ll show you that even when you change the values, the query will remain super fast. So even if I, let’s say I increase the data range to the 14th, it will still be 0.46 seconds. So nothing here is cached in. Let’s do 15 still 0.51 seconds. Let’s do 30 days. We’ll still be sub second. And even if I change the media source to, let’s put in a different media source here, I have values ready in advance. And these values are obfuscated, as you see, but really represent a real world use case. Now let’s do this again.
And as you can see that super fast. So what we essentially see here is a query engine that relies on sparse indexing. Data is stored in F3, Triple F format. And the query leverages that sparse index. Let’s see how we define the sparse index in the DTL. So if I go to the LTV table, which we just used, okay, this is the create table of the LTV table. And after I list all the fields, you see that there’s a line here that says primary index. Primary index has been established on this table over the date field and the media source field and another app ID field. This means that the files in the LTV table have been physically sorted on disk, using these three fields. And the sparse index when the query runs, leverages them when something is filtered on date and media source, this makes pruning for these fields dramatically faster.
You can also see the amount of data that was canned in the query when I ran it. So even though the database, the data set is huge, we can scan very little data because this is what the sparse indexing allows us to do. Now, even though this is amazing and really speeds up things, these queries would have been at least 20, 30, even 40 seconds with solutions like, [inaudible], Snowflake, Presto, and such. But even here, let’s make a little bit more challenging, and this would be a good Segway to talk about aggregating indexes. So I’m going to comment out immediate source so we can scan more data.
So now we’re scanning across these dates, but for every media source, like we saw, it’s a big dataset. So now we’re already at seven seconds and now we ask ourselves, “Hmm, maybe there are other techniques to speed it up even further and also make this query, sub-second”. Now, [inaudible] indexes are essentially a modern and cool new take on materialized views. So let’s remind ourselves what the story with materialized views is. So it’s a very well known and traditional methods to accelerate analytics by recalculating and materializing common aggregations, but just a bunch of challenges with them. First, typically they don’t support all of the aggregations that our queries need. So typically only a subset of them. Second, they’re high maintenance and also high cost. You frequently need to sort of rematerialize recalculate them so that they stay in sync with the original data source, dematerializing and recalculating takes compute and costs money.
Another annoyance is that users need to rewrite their queries to point to the materialized views instead of the source original tables. So users need to be materialized view aware and lastly, granularity. So once you create a materialized view, if you want to ask a question that relies on a different granularity than in your materialized view or saved in, it can’t help you anymore. So you need to create materialized views with different granularities to support different business questions. Now, I really think indexes are essentially your, I like to call it the, imagine the talented cousin of the materialized view. You define it with a syntax example on the right, and we see that in the product itself in a few seconds, but let’s first talk about the benefits. So first of all, they’re built for scale, materialized views. I’m sorry, aggregating indexes in Firebolt are also stored using triple F format and sparse indexes, meaning the table itself behind the scenes is also sparsely indexed.
Meaning that even if you work with huge aggravating indexes, the leverage sparse indexing to be fast in how we pull data out of them and look for their entries. Second, there’s zero maintenance. So if you look at the syntax on the right, you see that the index is created on that table. Once you define an index in Firebolt an aggregating index, it’s forever stays in sync with the data. So as you keep pumping data into your original table, the aggregating index will always stay in sync with very little compute overhead. And it’s super flexible. Every aggregation is supported, essentially everything you can put in a query you can put in an aggregating index, which is great. Even [inaudible] counts, which are often a problem for materialized views. And they’re super easy to use. So users don’t have to query the aggregating indexes. They keep querying the original tables.
The query engine behind the scenes can identify if a query can benefit from an aggregating index. And if it can, it puts it in the query plan. So for users, they don’t need to be aggregating index aware, they can keep working on their tables and behind the scenes, each table can have as many such indexes as you want, and they can kick in.
And they’re also much smarter than materialized views. Essentially, they don’t store just the exact [inaudible] and value, but actually a state of the aggregation. This enables us to also ask questions that are of different granularity than the lowest granularity that the aggregating index has, and still be able to leverage that index and get super fast results. And lastly, speed. So aggregating indexes are super fast. They typically turn any query into sub second query, even over huge scales and with very low hunger for CPU. So they’re typically even more efficient than what we started off with thanks to the sparse indexing. So let’s see this in action for a second. Okay. Let’s go to the DTL again. Remember the LTV table we talked about before, we had this query that we looked at. Now, I can take most things that are used in that query and create an aggregating index for it on the LTV table, with the dimensions that I need and the dimensions I expect to filter on can also have calculated fields. And I can have a case statements and whatever is needed. Just like my common dashboard queries, let’s say.
When I put such an index in place, I will go to this query now, remember ran a query that was 7.4 seconds because it scanned 128 gigabytes of data after leveraging the sparse indexes. And we want to accelerate it even further. So this time I will actually enable all waiting indexes, which is the default. I just disabled it beforehand. So I show you the performance with only the sparse indexing. So let’s run the sets on. So now aggregating indexes are enabled and if I run the same query again, then you see, boom, sub second, 0.95 seconds, super fast. And this is the beauty of a higher [inaudible] indexes. So even, I show you something else that’s cool.
Even if I change this to a cluster that is cheaper. I now use the $4 an hour cluster 36 cores, 72 gigabytes of RAM. And let’s switch to a cheaper cluster, $1.80 an hour. And I run the query, 1.62, not bad. So even though the cluster is less than double expensive, it’s more than double fast or less than double slow. Sorry. So you see that even if you use a very cheap hardware over huge datasets using these indexes, you can achieve amazing performance. Again, this would probably have taken a few dozens of seconds with all the query engines out there. Now we talked about how modern data warehouses can leverage new techniques with index [inaudible] and optimizing, essentially for compute and storage together to really get low-latency analytics. And what does it mean?
What does that mean for our work with data lakes? So data warehouses are essentially, they’re much more versatile than they used to be. Using a data warehouse that has storage and compute coupled means that you don’t need to think about it in the traditional sense. It can oftentimes be a query engine for a particular feature in your product, for a new use case and not just that monolith, big centralized traditional warehouse. So in regards to your lake, it really can be your lakes best friend. Syncing data from the lake to the warehouse has become cheap and easy. And in lake query engines like Presto and Athena, for example, cannot compete with the low-latency delivered by a query engine that has optimized compute and storage together through dedicated file formats and coupled indexing with that, et cetera. So if you want to get low-latency analytics over data lake, remember we love your data lake and it enables the dramatic acceleration opportunities that will help you to deliver use cases that were really impossible to deliver before.
Thank you. I hope you enjoyed this. You can email me at boaz@firebolt.io. If you have any questions or just check out our website and contact us, if you want to see that in action over your real data, hope you enjoy the summit, a lot of interesting talks, great meeting you. Thank you so much. May your indexes be sparse and aggregated.

Boaz Farkash

Boaz Farkash is a seasoned product leader with vast experience in data engineering and real-time analytics. Prior to Firebolt, Boaz was one of the first team members at Sisense, a software analytics u...
Read more