DataSource V2 and Cassandra – A Whole New World

Download Slides

Data Source V2 has arrived for the Spark Cassandra Connector, but what does this mean for you? Speed, Flexibility and Usability improvements abound and we’ll walk you through some of the biggest highlights and how you can take advantage of them today. Learn about such highlights as: Spark’s ability to understand Cassandra’s internal clustering, previously only available through the RDD api; Manipulating the Cassandra catalogue directly from Spark; and much more! Have Cassandra and Spark? Then this talk is for you!


Try Databricks

Video Transcript

– Welcome everybody to this virtual Spark Summit. And I’m pleased to talk to you today about DataSource V2 and Cassandra. So you might be asking yourself why is there a DataSource V2? There’s already a DataSource for Spark and Cassandra, and it connects the two of them and allows you to move data back and forth. But the question that was in everyone’s mind, at least a lot of developers’, was can we make it better than it currently is? And there was actually a lot of room for improvement. So for example if you’re familiar with the current Spark Cassandra Connector and how DataSource work with it, there’s two major points that I think could be addressed and improved. One is the ability to actually talk with the metadata inside of Cassandra, and the other is the ability to actually understand more about the underlying data within Cassandra, understand the partitioning, understand the sorting, and having Spark being able to take advantage of that while it’s running. And that is what DataSource V2 has brought to us. So we’re really excited to bring this to you. And just to tell you a little bit about myself before we get into the details, I’m Russell Spitzer.

I’m an engineer who lives in New Orleans. I’ve been working on Spark for a very long time and Cassandra also for a very long time. I actually started out as a test engineer at DataStax, and after a while, I moved on to the analytics team and the development side and worked on the Spark Cassandra Connector for a significant amount of time. So I’ve been into the weeds with Spark and Cassandra integration for a long time. And let me tell you that this new DataSource V2 work is some of the most exciting things to come to the Spark Cassandra connection in a very long time. So let’s get right into it and talk about the catalog itself and the things that it’s gonna bring. So previously DataSource V2 would have required us to know, or previously before DataSource V2, a user would have to know a lot of system details to actually get the connections working correct. And when it was done, Spark doesn’t know enough about that information that it’s pulled up to actually make Spark an intelligent optimization decisions. Now the reason that this was the case is that DataSource V1 was based on RDDs basically. It was basically a black box to the Catalyst optimization engine which meant after Catalyst told the DataSource a few things about the request that was about to be made, it was unable to actually understand what the underlying source did or how it was presenting data. And significantly enough, it had no connection to the underlying catalog information. So knowing whether or not tables existed or things like that was something that would fail much later in the process instead of being something that was intrinsic to how Spark worked with the data.

So in DataSource V2, we’re able to separate out a lot of different ideas, the ideas of catalogs and tables and readers and writers. All of these things get separated into different concepts that are able to be extended or interact with Catalyst at their own level which means suddenly we’re able to make a lot of different optimizations and things that we previously were unable to do. So let’s start with talking about how those made-up catalogs are gonna work with the Cassandra catalog. So the Cassandra catalog is gonna provide you with a lot of different features in DataSource V2.

Catalog Features

Now one of the coolest things is that you can have as many catalogs as you want, each of them representing a configuration of a Cassandra cluster somewhere. Once you’ve made that catalog, you basically are able to pick up all of the information that’s in that Cassandra cluster immediately and accurately. It’s going to completely take all of that information at runtime and learn about how the Cassandra cluster has been set up. Now if you’re familiar with some of the old ways that the DataStax Enterprise edition Spark Cassandra Connector handled this, it only had a thin wrapper around the metadata that was inside of Cassandra which meant that it could get outdated and you would have to manually refresh it if the underlying Cassandra tables change for things like that. In this new integration, it is directly connected to the Cassandra metadata which means that there’s no need for the end user to actually know that things have changed because the Cassandra cluster is going to automatically tell Spark and Spark will automatically know all of that. In addition, this catalog, since it is directly connected, can start creating and modifying tables inside of Cassandra from Spark SQL.

So how do we set up a catalog like this?

Catalog Architecture

Well, we need to basically create a reference inside of the Spark session which connects to a Java driver reference which will connect to our Cassandra cluster. And because the architecture is set up like this, every catalog can have its own independent configuration as well which means that each catalog can have different sets of driver properties. Again these things are going to be always updated because the Java driver is constantly exchanging information with the Cassandra cluster. If the Spark session asks how many tables are in this particular catalog? It will immediately go to the driver and the driver will check its own metadata references. So again everything will be always up-to-date and we’re able to change things directly. If we wanna set up one of these catalogs, the main thing we need to do is add the Spark Cassandra Connector to our classpath.

Catalog Setup

Now there’s information about how to do this more completely inside of the Spark Cassandra Connector documentation so I urge you to go there and look at the details there. But the easiest way to do that is to add –packages and then give the Maven coordinates for the Spark Cassandra Connector. This will pull down the Spark Cassandra Connector and all of the related dependencies and other drivers that are required to make it work and put them on both the driver classpath as well as on the executor classpath. Once you’ve done that which is probably the hardest thing, you really only have to set a single configuration parameter to get the rest of everything running. And that if you see here on the bottom left-hand corner, I have spark.sql.catalog.mycluster equal to com.datastax.spark.connector.datasource.CassandraCatalog. So in short, you set this Spark SQL catalog parameter to a name. So in this case I chose mycluster. And when I do that, I’m basically naming a new catalog for Spark to realize. So I’ve created a new catalog in this statement and I’ve told Spark that the underlying catalog library that it should be using is CassandraCatalog. Now I can set this parameter in a variety of different ways. If I feel like setting it in the Spark configuration, I can do that. If I would like to set it in the session configuration, I can do that like you see in the upper right-hand corner. And in addition, I can actually set these via Spark SQL itself just using a set command which means in the middle of a JDB session or in the middle of a Spark SQL session, I can live as I’m doing other work create new catalogs and connect new Cassandra clusters to the rest of my Spark ecosystem without doing any restarts or things like that. Once you set up a new catalog using one of these commands, you can actually change the properties which it uses to connect Cassandra by setting different properties equal to different values using that second SQL line that I’ve written in the lower left-hand corner, equals key will set the property value to key for the catalog mycluster. And I can do this as many times as I want, set as many different parameters as I want for the catalog. Now in this case all of the parameters that used to be available for any Spark Cassandra connection are available here. So any of those configurations that you’ve previously wanted to use, you can actually just add in as different catalog properties and they will be set for all of the references and tables accessed through that catalog. Now one more little configuration that you may wanna do is the spark.sql.defaultCatalog. So you can set that to any catalog that you’ve previously established. And when you do that, you remove the need to identify a database and table by the catalog name. So without that, if I wanted to access something within the mycluster connection, I would have to call mycluster.keyspace.table. But if I set the default catalog to my cluster, I no longer have to specify the cluster because every time the cluster is missing, it will automatically use that as the cluster. So this sets up our catalog. All we’ve done is added something to the classpath and possibly set two different configuration parameters. Once we’ve done that, we have direct access to all of our data inside of Cassandra.

Setting up and Using a Catalog A Basic Example

So in this example, I set up again the mycluster catalog to be a Cassandra catalog. And then in the next line I’ve set the default catalog to be mycluster so I don’t actually have to specify the cluster every time I do another command. In the first command, I’m gonna call show databases like created. Now you’ll notice this isn’t a valid SQL because what I’m doing here is not actually translating each line into SQL. It’s running Spark SQL against the catalog information that the Cassandra catalog has extracted from Cassandra. So because of that, I can do a whole lot of different things that are not valid SQL but are valid Spark SQL. So in this case I can do a show databases like. Now you’ll notice that there are no databases in Cassandra. Cassandra uses a different term called keyspaces. And unfortunately Spark also has two different ideas on what to call this level of naming, dataspaces and namespaces, and they’re gonna be used interchangeably and are both valid. So whenever you hear me say keyspace or you see the word database or you see the word namespace, we’re referring to the same concept inside of Cassandra which is the Cassandra keyspace. So in this case I was able to run this show databases command and instantly look up, oh, there’s actually a keyspace already made inside of this cluster called created_in_cassandra. Now that’s just the start, because not only can I look at the keyspaces, I can also look at all the tables that are inside of that keyspace. So to do that, I can just say show tables in created_in_cassandra. Again this is not SQL. I can write valid Spark SQL and it will just work against my Cassandra metadata. And you can see on the right that it did. I showed that inside of this created_in_cassandra keyspace, there’s a single table called dogs. Okay, that’s really great. And in addition, I can now access that data directly. Without having to register temporary tables or registering view, I can actually just call select * from, that keyspace, that table, and the data is immediately inside of Spark. So this is just the beginning.

Inspecting Table Metadata A basic example

We also have access to all of the metadata that exists within Cassandra inside of Spark now. So for example, the first command I’m showing here, we’re describing the namespace, which remember is a database or a keyspace, and we’re describing extended so we’ve got the properties of created_in_cassandra. Again this is a keyspace that exists only inside of the Cassandra cluster. Spark is just able to see it directly using the new catalog information. And you’ll notice that there’s this cool line down there, properties, and inside of properties, you’re gonna see a lot of things that if you’re a Cassandra developer, you’re familiar with. You see durable_writes, you see SimpleStrategy, you see the replication_factor, and that’s because Spark is now 100% aware of all of the Cassandra-specific properties of the keyspace. So all of that metadata that is Cassandra-specific is now available to you in Spark SQL as well. A similar thing happens when I describe a table that has been referenced in the Cassandra catalog. If I say describe table created_in_cassandra.dogs, we see all the columns, the correct data types, and then we see something very interesting. Down there at the bottom, there’s a note on partition. And what that’s showing you is that Spark has automatically discovered and is aware of the partitioning of this Cassandra table. This is something that was not possible previously. Spark is automatically aware of how the table is partitioned and is going to be able to use that for optimizations in the future. And I’m gonna talk about that a little bit later but let’s keep going and talking about the different ways we can actually work with this catalog. Now I was able to show you all of that partitioning information, but you also had access to all of the Cassandra-specific table properties as well. You’ll see, again, if you’re familiar with a Cassandra table, all of the properties you’re familiar with from SQL. You’ll see gc_grace_seconds, bloom_filter_false positive percentage, all of those things are now visible inside of Spark SQL. And this is really to show you that there’s this strong direct connection between the catalog and the metadata inside of Cassandra.

Setting up Multiple Catalogs for one Cluster

So how can I use this to my advantage? Well, one really cool thing you can do is set up multiple catalogs for one cluster with different configuration parameters. So, say, you wanna do a slow write to a cluster. Now you can have a catalog that has very fast writes enabled and you can have a catalog that has very slow writes enabled. So if you make these two catalogs, here I’ve made a catalog fast up at the top, and I’ve set its output throughput to 10 megabytes per second, so it’s able to write quite quickly. And then I’ve made another catalog called slow which has an output throughput of only .01 megabytes per second, so it writes extremely slowly. Now let’s say I want to copy this data from one table to another or to itself but I wanna do it very slowly. All I have to do is write using the slow catalog and read using the fast catalog, although I could read using the slow catalog as well. But basically when I do this in this insert statement, so this is the Spark SQL statement, insert into slow keyspace table, select * from fast keyspace table, I will end up using two different Cassandra connections and configurations. Each of these catalogs has their own setup and will be used in conjunction with each other to do this copy. So I’m able to do a much more complicated operation without a lot of complicated configuration. I really just have to describe how each of my catalogs treats its underlying tables, and that will get inherited by any operations I use with those tables. Now this is kind of a silly example, but instead if I wanted to move between multiple clusters, suddenly you can see how powerful this is.

Setting up Multiple Catalogs for Multiple Clusters

Say, I’ve got two different clusters, one I’ve got operating in Tokyo, one I have operating in Kyoto, and I wanna move data from one cluster to the other. Before I actually have written a couple different blog posts on how to do this. It required a little bit of configuration and setting up a lot of variables. And it would be very difficult to do from inside of Spark SQL all by itself. Now it’s five lines. All I have to do is set up one catalog for my Tokyo cluster and one catalog for my Kyoto cluster and then I can do a insert into one from the other. So in this case, I do that by just changing the host parameter of each of my catalogs. So I make a catalog A called clustera which connects to my Tokyo cluster by setting a connection host to 001, and then I make a Kyoto catalog by making the Cassandra connection host So under the hood inside of my Spark session, this makes two different catalogs and makes two different connection. Then when I do a command that inserts into one from the other, it automatically looks up which catalog it’s using and where to get the data from. So it makes it very easy to combine data from multiple clusters in multiple locations with multiple connection parameters. So this is really exciting to me. And one thing I wanted to highlight here is that you’ll notice that I’ve named my catalogs something completely different from the clusters. I just want to make it completely clear that there’s no need to have the names of your catalogs be related to the names of the cluster that you’re connecting to. They’re whatever you feel is easiest to understand for your own work.

So again I’ve shown a lot of reading and a lot of writing and how you can make all these different connection parameters, but on top of that, you can also make new keyspaces, make new tables, and you can do this all from inside of Spark SQL.

Create Tables – Create Keyspaces

So I’m showing two basic create statements here at the top. The first, create database if not exists, you’ll see it looks very similar to how you would make a keyspace inside of Cassandra but there’s some slight differences. Of course we don’t say keyspace and we have this dbproperties command, but basically we’re fitting the same idea into a Spark SQL creation statement. You’ll notice that I’ve specified my replication strategy, a SimpleStrategy, and I’ve done my replication_factor as one. And by doing this, I’m gonna actually make a database, a keyspace, in my underlying Cassandra cluster. So unlike, again, the DataStax Enterprise implementation before, this is going to actually make a new keyspace. It’s making new data inside of Cassandra even though you’re using SQL to do it. Now making a table is very similar. And you’ll see in the next create table statement, we can create a table within that new keyspace that we just made. So we made a new keyspace called created_in_spark. If we wanted to make a table in it called ages, we just say create table created_in_spark.ages. After we do that, we just specify what columns we have and what their types are. Then there’s the very important using Cassandra. So you might be used to with the previous implementation having to type out the whole org.apache.spark.sql.cassandra or whatever. We’ve now added in Spark 3.0, you don’t have to do that, you can just write Cassandra. And then underneath that, we know that inside of Cassandra there’s two required things most of the time. One is you need a partitioning key and sometimes you need a clustering key. Now you can set the partitioning key using the partitioned by syntax. Now this is native Spark SQL syntax so you can just write partitioned by and then all of the column names that you wanna be in the partition group. And then you can use the clustering key table property to specify the names of all of your clustering keys. Now again I will say look at the documentation for more details on the different ways you can set this up, but this is just the basic example of how you would do something like that. Now once I’ve made this new keyspace which I made only in Spark and this new table that I made only in Spark, this is a really common thing. So basically I had a table before where the partition key was name and the value was, the partition key was name and age was just the normal column. Now I’ve made a reverse table where the partition key is age and name is a clustering key, so basically a reverse lookup table. So those of you who are familiar with Cassandra are probably used to every once in a while having to do something like this where you have the exact same data but represented with a different partitioning so it’s easier to look certain things up. If I wanna populate my new table, I can just say insert into my age’s table select * from my dogs table. Since I know that these two tables have the exact same column names, it’ll actually do all of the mapping automatically and I end up making a completely new Cassandra table, Cassandra keyspace, all of this from inside of Spark SQL, and populating it all without ever using the Java driver once. So I’ve done all of this, and at the bottom I just wanted to do a quick depiction that on the left I’m using Spark SQL to read the data out of my new table, and on the right I’m using cqlsh to read the data out of my table just to be absolutely clear that I made a new table inside of Cassandra and that I’m able to access it in any way I want. So I can access it through Spark SQL or I can access it through the Java driver, no problem, and I can also access it through cqlsh.

So we have a huge amount of power here but it doesn’t even end there because I can continue to do all kinds of cool things in Spark SQL.

All Cassandra Table Options are Available Maximum Cool A

I can actually now use Spark SQL to, in addition, alter the properties of my keyspaces and my tables in Cassandra from Spark SQL which means all of those things like gc_grace_second, bloom_filter_false positive percentage, all of those table properties I can now modify directly from inside of Spark SQL. Same thing goes for keyspaces. If you would like to change the replication of your keyspace from your JDBC connection to Spark, you can now do that. So SQL now provides you with almost all of the same capabilities that you would have had within SQL but you can do them without ever having to use SQL. So to me this is a really great opportunity for a lot of new combinations of working with tables and working with Cassandra and reduces the number of tools you have to learn or know about. Now I’ve shown a lot of examples here and I just wanna say this is just scratching the surface. Using any kind of ncsql, all the things that you’re familiar with like create table as select are now possible. So any sorts of SQL things that you once do that previously would make a table, a brand new table, you can now do, and they will make brand new tables inside of Cassandra.

So again this is the huge change in the UI. You basically as a user are able to do all kinds of brave and exciting new things that you just couldn’t do before. Now that’s all great but we also on top of all of that are making it a ton of great new performance enhancements which are coming into the DataSource V2 implementation. So I’m gonna just highlight how that partitioning information that we talked about before is gonna provide you with a lot of new benefits. Now before like I said, we had this DataSource providing a RDD this far which means it didn’t have a lot of information about how the data was actually laid out inside of Cassandra which meant that it was able to do some optimizations by basically doing some filter pushdowns and some other things like that but it doesn’t know enough details about the data underneath to actually make smart Catalyst decisions, catalyst being the optimizer for Spark if you are not familiar. In the new DataSource V2 implementation, it’s a lot of different ideas that are separated into really easily extensible components. So for example instead of having this RDD, I have this notion of a Cassandra table, and the Cassandra table has a way of talking about how you read from it and how you write from it and different ways of interacting with those readers and writers, and Catalyst has ways of asking for information and saying, oh, how is your data partitioned or how many partitions are you expecting to make or how much data is expected to be here? And basically there’s a whole lot of great new extensible ways of adding more information to this implementation in the future which means we can always make this smarter and smarter without requiring the end user to actually know that all of a sudden the underlying implementation has great new ways of optimizing this. Now in particular in the initial release of DataSource V2, we can tell Spark all about how our underlying data is partitioned. So what kind of queries do this make about? Now if you remember that we just made a new keyspace inside of Spark from the Cassandra catalog and we were also able to make a new table inside of Spark from the Cassandra catalog, and in our new table, the partition key is a column called ages. Now if you’re familiar with how Cassandra lays out data, within a partition, there’s only one partition key value which means that if I’m interested in all of the distinct partition key values, I basically just have to walk through my Cassandra table and at every partition just look at what the key value is and then go to the next one which means I don’t have to compare data from multiple partitions to know if there are multiple instances of the same partition key value. Now you’ll see on the left here, this is the advantage of that. When I call a distinct, if I say select distinct age from my Spark ages table, Spark is now able to see that the partitioning has provided it with a shortcut for actually looking up all the distinct values, and there is no shuffle in this plan. If you saw the word exchange here, you would see an exchange here, an exchange would mean that it has to do a shuffle during this operation to actually figure out the answer, but since you don’t see that, it’s actually telling you that it’s gonna do a single scan through all of the data and it’s gonna come up with the answer, so that’s a huge benefit because now I don’t have to shuffle on top of scanning all of the data out of Cassandra. Now if we look at where this doesn’t come into play is imagine that I was then trying to look up all the distinct names. Now the way Cassandra distributes name or the way that Cassandra distributes clustering columns is just ordered after each partition column. So if you remember this ages table is partitioned on age and then has names as a clustering key. So after each age, we’ll basically have a list of names that will go off. And if we wanna find the distinct ones, we’re gonna have to actually compare multiple partitions. Now the reason for this is that if you look for example, all of the bob values are in a whole bunch of different partition, and all of the sue values are in a bunch of different partition. So the only way to accurately count which are distinct is to merge all this data together and determine which ones are actually unique. We can’t just go one partition at a time and say, oh, all the values there are unique. So this means that when Spark does this request, it’s going to have to do a shuffle in the middle of it, and that’s where you see that exchange then. So the big difference here in this DataSource V2 implementation is that in DataSource V1, both of these requests would equally require a shuffle, and now in DataSource V2, it’s able to look at the request, look at the underlying Cassandra table and automatically make the optimization knowing that that’s how the data has already been distributed. So you can take more advantage of the way you already laid your data out in Cassandra when you’re doing Spark SQL. And the best part is in both of these SQL requests, you’ll notice I’ve written both of the text up here, select distinct age, select distinct name, I didn’t say anything about how the data was partitioned because I didn’t have to. Spark is gonna automatically make all of these optimizations for me because of the way DataSource V2 is set up. So again you’re getting all of these optimizations for free under the hood and you don’t even have to know about it. So this is just the beginning. Like I was saying, these are all very extensible interfaces which means in the future we’re gonna be able to add even more optimizations that are gonna make all kinds of great things possible. For example we’re gonna introduce information about the clustering within those partitions, we’re gonna be able to add in actual partitioning functions, so different kinds of other things will be allowed much more optimally. Anyway this is just the beginning of the kind of performance enhancements you’re gonna be getting as long as you’re moving to this new API.

Now this is all DataSource V2 stuff, but I have another great announcement. You may not have known this, but DataStax Enterprise previously had a lot of proprietary features and Spark Cassandra Connector things that made a lot of common Cassandra things more efficient or faster or just frankly much better. And then now all of those options are available inside of the open source connector. We’ve taken all of those proprietary enhancements and moved them back into open source. And I just wanna do a quick overview of all of those because they are gonna be available as well with the new DataSource V2 code and without it as well if you’re interested in not moving to Spark 3.0. So let’s just go over a quick overview of how you set that up. Like before, you really just set one option and that’s spark.sql.extensions, you set it to CassandraSparkExtensions. Once you’ve set that, it will automatically load all of the new Catalyst rules and new functions.

When you’ve done that, you will start getting these great optimizations like this join optimization. So DirectJoin is something that we’ve talked about a lot. And if you’re familiar with the RDD API for Spark Cassandra Connector, this is basically doing an automatic JoinWithCassandra table which for those of you who aren’t familiar is basically taking advantage of the fact that Cassandra is very fast at looking up partition keys. It’s kind of slow doing full scans but it’s very fast at looking up individual keys. So if you’re looking up a set of partition keys out of your Cassandra cluster, say, you’re looking up one million partition keys out of a Cassandra cluster with two billion keys, it’s much faster to actually individually request one million row or one million partitions than it is to scan through two billion partitions and abstract out the one million that you actually want. Now this is an automatic optimization that will happen whenever Spark see that you’re requesting an amount of partition keys from a Cassandra cluster and you aren’t, and the amount of partition keys is much less than the estimated number of partitions in the Cassandra cluster. So it’s gonna automatically make this optimization whenever it’s most efficient. You’ll see that this is gonna come in place a lot when you’re doing for example a streaming workload where you have a streaming set where every once in a while 40,000 partition keys need to be looked up really rapidly. So if you’re joining between an incoming data flow and the Cassandra cluster, and you wanna join based on that partition key, this will automatically optimize that and make it really efficient. So this is really good for a lot of use cases especially when the number of keys you’re requesting is more or less static and the size of the Cassandra cluster is going to grow linearly over time. So again this comes automatically for you. You can see that it’s happening in this particular example, if you look at the bottom, you’ll see inside of my Spark plan, I actually have this line, Cassandra Direct Join, just right in the middle and that’s showing you that the optimization actually took place. We have more documentation about this on the Spark Cassandra Connector so I encourage you to look there if you’re interested. In addition, we have provided through Spark SQL the ability to use functions to retrieve TTL and Writetime from your underlying Cassandra database.

ITLWritetime Support

So just like you were using CQL and you can call a TTL function or a Writetime function, you can now do the same thing inside of Spark SQL. So for example if I wanted to take the TTL Writetime data out of my dogs table, I can now do that by just calling select TTL age comma WRITETIME age, and it’s gonna automatically transform those functions into the metadata column references that are required to get that information about the columns. So I’m actually able to export all of the Cassandra information inside of Spark. And I know this is a really common issue for a lot of people who wanna do work based on Writetime or based on TTL and you wanna work on that inside of Spark SQL.

And then the last optimization I wanted to talk about is an in clause to join optimization that we’ve added.

InClause to Join

So this is something we saw happening with a few of our users. And basically it’s for users who have a composite partition key and are looking up for a combination of many values inside of that composite partition. So in this particular example, I’m showing that we have a three-part key, A, B, C, and we’re looking for the values between one in 100 of A one in 100 of B, and one in 100 of C. But I really just want the cross product of all of this, so I want one one one, one one two, one one three, and as you can see that explodes into a very large number of actual values. Now the old way the optimization would work is that it would take all of those values and pack them into a single CQL statement and send that off to Cassandra. And that was bad for a lot of different reasons, one of which being that sending that much information in a single CQL statement is really bad for Cassandra. So instead we thought what is the more Spark way of doing this? And what we changed is that now whenever we see that sort of operation happening, instead of sending out a single CQL reference, single CQL request, we’re gonna switch it into a list of partition keys and do that join optimization that we talked about previously. So now instead of having to make all of the combinations of A, B, and C, put them into a single CQL statement, we’re going to have all of our executors distributively take various permutations of these keys and run them asynchronously as single partition key lookup requests. So again I encourage you to check our documentation, more details on how to set this up and how to configure it. But basically any time you’re doing a big in clause now, we’re going to optimize that into a distributed join, so it’s gonna end up being much more efficient, easier on Cassandra, and much easier on Spark.

So not last but not least, one of the coolest things that we’ve added now is also that we’ve added full support for DataStax Astra and the Java Driver 4.0 which means if you’re using Java Driver 4.0 connection profiles and things like that, you can now use those directly inside of the Spark Cassandra Connector. You don’t have to use any of our configuration code if you don’t want to. If you’ve got a configuration file for your application, you can just set that as your catalog’s configuration and it will use that file and connect to your cluster. In addition, we fully support DataStax Astra which means if you get your connection bundle from DataStax Astra, you can now use that in the Spark Cassandra Connector natively as well. So this means that if you’re using Databricks and you’re using DataStax Astra, you can now use a cloud-to-cloud connection with almost no additional setup. So you can have your complete database in the cloud and you can have all of your Spark in the cloud and you never actually need any local resources at all. So this is a really exciting development. So this is all cool, but you’re asking how am I gonna get ahold of this? I wanna know when can I actually start using all of these new features. Well, the answer is if you’re interested in a lot of this, you can get it right now. So we have already released the Spark Cassandra Connector 2.5.0. And inside of this, there are all of those DSE features that I mentioned, everything, including the Astra support is now currently available and compatible with Spark 2.4 and all of the other Spark 2’s. We’re going to keep maintaining this branch for as long as Spark 2.X remains a popular platform in the community, but 2.4 does not really support the DataSource V2 code so it does not actually have those features. If you want all of the DSE features as well as DataSource V2 catalogs, support, and all of that great stuff, the Spark Cassandra Connector 3.0.0 is gonna be released probably some time between the recording of this talk and the presentation of this talk, and we are gonna release it as soon as Spark 3.0 has a valid release target. So this is all coming very, very soon, and we’re gonna add all these, and then we’re gonna continue adding more DataSource V2-related features like deletes and things like that in the future. So all that great new stuff is gonna go into the 3.0 branch of the Spark Cassandra Connector. So watch those two releases. And of course the important thing about all of this is that it’s all open source which means if you’re interested in the code for how any of this works, you’re welcome to go online and check it out.

OSS needs you!

Our GitHub is listed there. We also at DataStax provide a community forum called It is a Stack Overflow style question and answer site so you can ask all sorts of questions and there’s experts from DataStax there all the time answering all of your questions. And of course like any open source project, we live and breathe by our mailing list. So down there I’ve got a link for our Google Groups account. Feel free to join the Spark Connector user group and send us any feature request, any bug notices, any doc corrections. Anything you think we should know about, we would be glad to hear it up. And in addition if you wanna be someone who’s writing code for the Spark Cassandra Connector, this is also a great opportunity. Send us an email to the mailing list. Tell us about what you’re interested in and we’ll be glad to help you find a project, find a feature to work on for yourself. So thank you so much for listening to my talk. I hope that this has inspired you to try all of our new features, our new code, and get involved with the community. So please let us know how everything is doing. Be sure to let me know how this talk was. I hope I was able to convince you that the new DataSource V2 stuff is very exciting and is gonna help you in your work. And again thanks for listening.

Try Databricks
« back
About Russell Spitzer


After completing his PhD work at University of California, San Francisco, Russell joined DataStax to fulfill his deep longing to work with distributed systems. Since then, he has worked with Cassandra, Spark, Tinkerpop, Hadoop, as well as a myriad of other big data technologies. His favorite hobby is finding new ways of bringing these technologies together so that everyone can benefit from the new information age.