Growing the Delta Ecosystem to Rust and Python with Delta-RS

May 27, 2021 12:10 PM (PT)

Download Slides

In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap.

We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.

In this session watch:
R Tyler Croy, Director Of Platform Engineering, Scribd

 

Transcript

R Tyler Croy: Hi, I’m R Tyler Croy, and today I’m going to be talking about what we’ve been doing to grow the Delta Lake ecosystem with Rust, Python and some more cool things. Before I get into that, I want to sort of share a little bit about who I am, why I’m here. I’ve been an open source contributor for a long time. Open source is incredibly important to me, it’s been a huge part of my career, and I’ve been in the Jenkins community, the Ruby community, Python community, and most recently the Rust community. And I’ve had a lot of fun in the Rust community, which is part of why we’re here to talk about delta-rs.
My day job isn’t to work on open source, unfortunately. My day job is I’m the director of platform engineering at Scribd, which is an organization composed of the core platform team, data engineering, and data operations. And together, these three teams support a fairly large data platform infrastructure for Scribd to really accomplish our mission of changing the way the world reads.
Before I get too far into talking about how we’re growing the Delta Lake ecosystem, I want to make sure that we do a little bit of a rehash on what Delta Lake actually is, which is important to know. There’s a lot of great content at Data and AI Summit around different aspects of Delta Lake, getting started with it.
But at sort of the most basic level, what you really need to know is that Delta Lake is really parquet files on one hand married to a transaction log on the other hand. And those two concepts together really allow quite some interesting and sophisticated behaviors on top of data. The parquet files give you obviously high performance reads on data, and then the transaction logs allow you to perform ACID like transactions, almost like you’re working with Postgres, for example.
That well define transaction log allows us to do some really interesting things in terms of sequencing transactions, writes, reads, updates, et cetera. And it’s very, very cool, but it’s also pretty simple. If you go to this link that I’ve got at the top of the deck, this protocol MD, it’s the open specification for what Delta Lake actually is. You could probably read through it and understand it in less than an hour, it’s pretty straightforward.
And Delta Lake really doesn’t have a lot of magic to trip you up or cause problems with them. That leads me to delta-rs, the rs, in case you’re not familiar stands for Rust. The delta-rs library, was something that was started in mid 2020, and there’s a number of reasons on why it was actually needed. At Scribd, we originally proposed the delta-rs project, because not everything that we do needs a Spark cluster.
At the time and this has since changed, but at the time you needed Spark in any form, whether that’s a single node cluster or 100 node cluster to read Delta files, excuse me, to read Delta tables, or to do any of these sort of transactional semantics. And for a lot of the things that we do at Scribd, we didn’t have a need for an entire Spark cluster to read a bit of data from Delta table.
I should also point out that for a lot of people getting data into Delta Lake is an ongoing and important challenge. For us, we ingest a tremendous amount of data from Kafka, from MySQL databases, from external sources, where the full power of Spark might not be really bringing anything to the table. In some use cases, excuse me, we need to bring data as quickly as we can from wherever it’s being produced, and write that into a Delta table so that downstream streaming jobs, or batch jobs can actually process that data.
And Spark really wasn’t helping us all that much at the ingestion side. And then there are these hybrid workloads. I have a little bit difficulty describing these workloads, because we have some use cases where a background job, say in sidekick, in the Ruby ecosystem needs a little bit of data from the data warehouse stored in Delta Lake, and a little bit of online data, like live data from MySQL database.
And because we’re primarily a Ruby shop, we’re not going to bring the Spark into our Ruby applications, that wouldn’t make sense. And we don’t really have a way to sort of blend these two sort of models of working with data on those boundary points between background batch processing of data in the data warehouse, and sort of online close to the way application data processing.
And that’s where we thought that delta-rs was really needed. The link that I’ve got here at the top is a blog post that I wrote recently, which I would encourage you to take a look at. Which just explains a little bit more detail about the backstory behind delta-rs and what led to it.
Let’s talk about delta-rs. Delta-rs, the goal is really to extend Delta Lake outside of the JVM ecosystem. I personally spend a significant amount of time in non-JVM languages. Of course, we’re using Spark, of course, we’re using other Scala and Java based platforms where necessary, but we have a lot that’s not on the JVM. And what delta-rs provides right now is it provides a Rust binding, and on top of that, we’ve actually built a Python binding as well to interact with Delta from a scripted interpreter or a non-JVM environment.
It right now supports your local file system, it’ll support S3, it’ll also support Azure Data Lake Storage Gen 2, I always screw that up. And it’s production ready, there’s actually people using delta-rs, the Python bindings in production right now, which I think is so cool. Unfortunately, Scribd right now is not yet using it in production, but we will be soon.
Some other things that are still sort of in progress, we’ve got this multi writer support on S3. If you’re running on S3 right now, or in the AWS ecosystem, there’s one key primitive. And I believe it’s [Put F] absence, is missing from AWS S3 that allows truly atomic rights to a Delta log or a Delta transaction log on AWS S3.
And what we’ve been adding into delta-rs are some semantics around DynamoDB to allow a Rust writer to very safely work with a Delta table on top of S3, which is fundamentally eventually consistent. Azure Data Lake Storage Gen 2, doesn’t have that problem, so you actually don’t need that support on Azure Data Lake.
The other thing that we’ve been sort of working on, and I would certainly love to see more contributions are the Ruby bindings, being built in Rust allows us to really grow into a lot of different areas. And Ruby is one of them, or one of them that I’m actually most interested in. But it’s not everything around why we chose Rust.
Rust, I personally like, I’m certainly biased, but the design of Rust allows for the implementation of very fast and safe libraries. And when I say safe, I don’t just mean type safe, I mean, concurrency safe. It’s actually fairly easy to build a high-performance, multithreaded, multiprocessor Damon in Rust, because of some of the inherent design of the Rust language.
When we went to look at what we were going to build, and something that we’re going to use for interacting with Delta Lake, which we would need to use that sort of high concurrency, high throughput sort of support, Rust was an obvious choice. The other thing I don’t think is appreciated well enough about Rust, is that it’s very easy to embed.
Anything where you can do sort of see to my scripting language of choice interrupt, Rust can fit in there with relative ease, that’s because there’s not a runtime in Rust, there’s no garbage collector, there’s none of that stuff that you would find in Node, Python, Golang, or some of these other languages that have a runtime to help you manage your memory. Because it’s so easy to embed, that also means that we can provide bindings with relative ease in Python, which already exists, and the Python bindings are fantastic, I can’t recommend them enough.
Ruby which is in its infancy, Golang, or Node, or anything else that has a C-level interrupt, the Rust bindings can really work well. And all of these things put together, Rust is also just a lot of fun. There’s a reason that there’s a lot of hype around Rust, I think it’s a really fun language to work in. And I recommend checking it out.
If you want to learn, I learned Rust, I’d say the hard way, it took me a while to really figure out Rust. But the book, as it’s sometimes referred to, The Rust Programming Language book, which you see on your screen, it’s helped me learn Rust and has helped a couple of other individuals within Scribd, also come up to speed on Rust, which I can’t recommend enough.
Delta-rs, it’s a lot of things. There’s a lot of pieces into it, but from a code standpoint, it’s actually not that big. At the base layer, we have some Rust-based dependencies, like our underlying async await provider, tokio, allows us to build high concurrency applications pretty easily. And then on top of that, we’ve got the parquet and arrow crates, which allow us to do the sort of data management, or data interrupt that we need to do when we’re loading data from Delta Lake.
Both the parquet and arrow crates actually weren’t, I would say, ready in mid 2020 when we started to work on this. And so Scribd has been working with one of the Aero committers, a guy named Neville, who’s actually brought parquet crate writer support into the parquet crate, and has done countless improvements to the arrow crate to make sure that it would support delta-rs’s use case.
And then of course, to support the object stores that we need to interact with, we’ve got rusoto for interacting with S3, and then the Azure SDK to make sure that we can interact with those data stores as well. And then you see right in the middle, there’s this thick line, that’s the deltalake crate, you can do a cargo install deltalake of that, if you want to, we provide some simple tools out of the box, or you can just add that into your cargo tunnel.
But that layer is the layer upon which all of the native bindings into say, Python, or Ruby, or Golang could be implemented. And there’s a couple more little bits of Rust, there’s this really cool library called Maturin, I’m not really sure how to say it. And then Pyo3, that make it really easy, really, really easy. I can’t underscore that enough, I was shocked at how easy it was to create a Python library that was dependent on Rust code, thanks to those two tools.
And so you can also get Delta Lake Python, which if you just go do a pip install deltalake right now, you’ll get the Python bindings, and you can start reading Python, or reading Delta Lake from your Python interpreter right away, which is super cool. Let’s look at some of the code that actually power… They’re using some of this delta-rs code, just a little snippet of Rust code, this is the Delta inspect binary that comes with that cargo install deltalake.
All it does is allows you to inspect some metadata around a Delta table. In this example, all we’re doing is just opening the table and then printing it. It’s fairly simple to work with Delta tables within delta-rs, where things get a little bit more interesting and you will have to make some decisions around what library or tools you want to do things, is when it comes to query those tables.
Because Delta Lake is a transaction log, plus a parquet file, we do have to bring some other tools into the mix to make sure that once we have identified the parquet files that we want to load in query from, that something is able to query those. Let’s go look at Python, because I think Python’s a little bit more easy to understand, and it’s a little bit, I’d say, clearer for us to use as an example in slide format.
This is a very, very short program at the bottom, right? This program is just opening up a table, printing some metadata, and then generating a pandas data frame. Pandas is a data processing library in Python, it’s also a really, really wonderful library. I can’t recommend it highly enough, but in this case, my audit logs, Delta table just has one record in it.
We’ve got one file, which you can see listed there. This blah, blah, blah, parquets, snappy.parquet. That’s one of the parquet files that was saved to the Delta table, and then when I load that in query it, which is just the printing of the data frame, you see there, then it just dumps this one record and which is 11 columns, et cetera.
This is a little bit of a contrived example, so let’s look at a real demo using Python and pandas. In this demo, I’m going to import the Delta Lake Python package in my Python interpreter. And once I’ve got my Delta table object, I’m going to go ahead and load an object, or load a Delta table up, excuse me, give myself a DT object.
And for this data set, I’m just using a test table basically. And one thing I should point out if you’re not familiar with using the Python interpreter, is that there’s a help function, which you can use to look at the Python documentation for any object, any variable anywhere. Now I’m going to look at the files, and there’s only two files in this table. It’s pretty straightforward.
And because I have pandas installed, I’m going to go ahead and create a pandas data frame. And this takes a second because it’s got to load those two files into memory. And then I’m going to look at the help just to make sure that this is what I think it is. This is a data frame, and I can start to query it, which is really cool. Let’s go ahead and query. In this query, this is a pandas query, I’m just looking for tables or rows that are having an even number as infield.
I’m going to query for a true field, and the benefit of doing this in your terminal is you can find these query errors in this case, Python true is a capital T. And then you can start to using that. And all of this sort of comes together to be this really useful means of experimenting with Delta table, excuse me, right there from your local machine.
All right, that’s kind of in a nutshell, using Delta Lake from Python, which is to me, it’s super cool. I really enjoy working with Delta tables right for my Python interpreter, and doing some sort of local data science, if you will, before shipping something into production. What you can do right now with delta-rs and the Delta Lake Python package, as I mentioned before, you can access Delta tables, whether they’re stored in AWS S3, Azure Data Lake Storage Gen 2, or your local file system.
In order to access those, you will need to make sure that the right environment variables are set, the AWS_access key, et cetera, et cetera. But once you have a Delta table loaded, you can actually read by the partitions, you can look at checkpoints, you can actually, with the rest binding right now, you can read the streaming updates to the table, which is pretty cool.
And we also have some preliminary support for actually writing to the transaction log. Now, writing to the transaction log is still fairly early, and so what that means is you can’t quite write parquet directly through the delta-rs library, or the Delta like Python binding just yet. If you have a parquet file that’s been written somewhere else or by written by something else, or by another piece of code, you can then make the transaction log entry using delta-rs to update the Delta table itself.
Right now, delta-rs also can’t really create checkpoints. This doesn’t prevent the table from being read. It doesn’t create invalid Delta like tables, it just has a little bit of a performance impact. And if you take a look at that protocol, that MD that I referenced earlier, you can read more about what checkpoints are, and how they improve performance in a Delta Lake table.
We also can’t optimize right now and optimize typically means two things, they’re command that you can use in the Data Lake run time to basically bin pack a number of parquet files. What that means is if you say over the course of five minutes, I created 10 file, or what not, 10 files, let’s say, 100 files, and they’re all fairly small.
An optimized command can come through and take those 100 files, and turn those into one or two files depending on what’s size appropriate, so that your query performance doesn’t suffer because your query, whatever’s querying, it needs to access 100 different files. It can just access those two or three and load those into memory.
There’s another part of optimize that we’re not likely going to be supporting anytime soon. It is a fairly challenging thing to do and does require some compute resources. And that’s the Z order optimization, because that requires actually processing a fair bit of data to appropriately optimize that data.
All of that said, delta-rs is there, it’s ready for you to start using. If you’re in Python, you should just do a pip install deltalake and pandas, so that you can actually work with the data like I showed him my demo. And I also wanted to recommend this book, the Python Data Analysis Book, which should give you a good introduction, and overview of using pandas to do data analysis in Python.
If you’re like me using Rust, you can just add the Delta Lake dependency to your cargo toml, and of course, I’d highly recommend the Rust book if you’re interested in getting into Rust. And I also wanted to mention that both of these titles are also available on scribd.com. Kind of what we do is make really interesting content like this available.
If you dig around on scribd.com, you’ll also find a lot of really interesting documents, or academic papers around both Delta Lake, but also data analysis, and data processing. With that said, let’s talk about what’s coming up next. There’s some work that’s in very active development right now in a project called Kafka Delta Ingest, which the link is at the top of the screen.
This is something that Scribd is contributing quite a bit to, of course, just like Delta Lake it’s open source. And the goal for Kafka Delta Ingest is to make it show that we can as quickly and effectively bring data from a Kafka topic into Delta lake tables. It’s still a little bit early, we’ve gotten some good single mode support working right now, and this is heavily dependent on delta-rs.
And so to a certain extent, delta-rs and Kafka Delta Ingest, have been developed sort of side-by-side. And if you ever find yourself developing a library, I can’t recommend this sort of set up enough, because what happens is we find new and interesting things, or use cases in Kafka Delta Ingests that we then need to, or want to bring into delta-rs, to make delta-rs a better library to under underpin multiple different use cases, not just Kafka Delta Ingest.
Kafka Delta Ingest, also I should be very clear is not doing any streaming transformations, nor are we planning to do so, excuse me. Spark streaming is very good at stream transformation and processing heaps and heaps of data on the fly as it streams through in batches. What Kafka Delta Ingest is really aiming to do, is to get data from one file descriptor, that’s got a Kafka data on it, to another file descriptor, writing that data into AWS S3 for example.
There’s a number of different problems that we’re trying to solve with Kafka Delta Ingest, and if you go to the GitHub link above, you can actually look at the design document there as well, to get a better sense of this. But I wanted to highlight the design here, Spark is I would say, challenging to run for variable throughput topics, and at Scribd we have a lot of variable throughput topics.
The data that’s coming in might be event data from scribd.com, where the traffic is going to ebb and flow throughout the day. And what that means if we’re going to allocate Spark streaming infrastructure to solve that, is that we either have way over provisioned infrastructure supporting that, or we run into cases where infrastructure might be under provisioned, the cluster might not be big enough to handle the load coming in, and we get alerts and pages from that.
Kafka Delta Ingest in contrast is designed upfront to be auto scalable, which has made some interesting challenges for making sure that we get the design correct to allow for multiple delta-rs based writers to be working in concert together, and making sure that we’re getting data flowing correctly and properly into Delta table.
The Delta community is young, I would say, it’s been around less than a year. We’ve seen a tremendous number of just absolutely fantastic contributions. Of course, you can find us in the delta-rs, or the Kafka Delta Ingest Slack channels in the Delta community Slack. But I wanted to call out some of these really great contributions that have already been made since we originally open-sourced delta-rs.
Florian, for example, made the Python binding what it is. It wouldn’t be here without his work to make Delta Lake and Python really fantastic. And then another fellow named Ben, added the Azure support. Script is an AWS customer, I’m quite tickled that someone came in very early and added Azure support, and along the way, actually improved the asynchronous nature of delta-rs underneath.
We also have Misha, I’m not going to try to pronounce his name, comfortable with Misha. Misha added really good safe concurrent writer support into delta-rs, and there’s some pull requests that are still going on around that to make sure that AWS-based clients can really trust the writes going into their Delta transactions.
And then finally, without Neville, we really wouldn’t know actually be able to write to parquet from Rust. And he’s made other improvements to arrow that have made it really, really a key component of what makes delta-rs actually work. I encourage you to check out delta-rs, the GitHub link is there. Pip install deltalake, or cargo install deltalake either will work fine.
If you’re interested in contributing, I would love to see Golang bindings, or Node bindings so that we can grow Delta Lake outside of the JVM ecosystem, outside of the Python ecosystem, and really spread it as far as we can. But with that said, thank you for your time.
If you want to learn more about what Scribd is doing with Rust or with Delta Lake, or even some of the other things that we’ve got going on, check out tech.scribd.com, which is a blog post. You can also find me on, on GitHub, or via email if you’ve got any follow up questions. Thanks.

R Tyler Croy

R. Tyler Croy leads the Platform Engineering organization at Scribd and has been an open source developer for over 14 years. His open source work has been in the FreeBSD, Python, Ruby, Puppet, Jenkins...
Read more