Geospatial Options in Apache Spark

Download Slides

Geospatial data appears to be simple right up until the part when it becomes intractable. There are many gotcha moments with geospatial data in spark and we will break those down in our talk. Users who are new to geospatial analysis in spark will find this portion useful as projections, geometry types, indices, and geometry storage can cause issues. We will begin by discussing the basics of geospatial data and why it can be so challenging. This will be brief and will be in the context of how geospatial data can cause scaling problems in spark. Critically, we will show how we have approached these issues to limit errors and reduce cost. There are many geospatial packages available within Spark. We have tried many of them and will discuss the pros and cons of each using common examples across libraries. New users will benefit from this discussion as each library has advantages in specific scenarios. Lastly, we will discuss how we migrate geospatial data. This will include our best practices for ingesting geospatial data as well as how we store it for long term use. Users may be specifically interested in our evaluation of spatial indexing for rapid retrieval of records.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hello, everyone. Thanks for joining this virtual session.

Geospatial Analytics in Spark

I think it’s a little bit different for everyone. So we’re just gonna see how this goes. I’m really excited to share with you some of the things that we’ve learned as we’ve transitioned our geospatial analytics into Spark. For those of you that have been doing this for a while, I think the first part of the presentation may be a little repetitive, but I don’t think it might help some of the managers that haven’t done a lot of geospatial stuff understand why this has been so hard. When we were doing this, there are a whole lot of unknown unknowns that really kind of got us caught up. I’m hoping that this presentation will help the other people that are kind of new to this, avoid those issues.

So the goal of this presentation, I just wanna kind of box this in is really to provide practical examples of pre-processing and analyzing vector data at scale. It’s a huge amount of variety and geospatial data. So we’re really just focusing on polygon in point data as rasters and satellite imagery, all that kind of stuff is outside of the scope of this part of the presentation, although some of the things will be relevant in that space as well.

So in terms of a quick agenda, we’re gonna go through a little bit of housekeeping, then we’ll talk about the challenges with geospatial analytics. Why is this so hard? That’s a question that my manager is asking all the time. And then some case studies. So we’ll do some practical use cases and functional examples. I got a little ambitious with this when I put the slide deck together and I had three, we actually had to cut it back to two, I do have the third one available. It’s a 30 minute video all by itself, and I didn’t wanna take up the entire presentation with that.

So right off the bat a little bit about PNNL. PNNL is part of the Department of Energy’s National Lab Complex.

About PNNL

This originated from the Manhattan Project. So you might be familiar with some of the other labs like Los Alamos or Sandia, or Oak Ridge. Those are all part of the same lab, same team. So we all kind of work together to solve some of the nation’s challenging problems. There are 4400 staff that work for the Pacific Northwest National Lab. We’re kind of all around the country, most people are in the Pacific Northwest, not unsurprisingly, we do about $1 billon in business and it’s interdisciplinary and highly matrixed, which makes it really interesting to solve some of these data science problems because we might walk into a room and there’s a fish biologist, a nuclear researcher, and a rocket scientist that just came out of SpaceX, makes life interesting.

So I must say thank you to a bunch of the people that are on our team. Again, we’re highly matrixed so drawing the line on the team can be a little weird, but these are just a listing of a few people that have been extremely instrumental in helping us kind of get going in Databricks to understand the geospatial realm and really deliver impact to our sponsors.

So I feel like it’s important to put a disclaimer out there, I’m not a geospatial oracle. I’m not sure if that even exists. I’ve met lots of really intelligent geospatial people, but we seem to all bring a little bit something different to the table. This presentation will document my current knowledge and experience, but the landscape is constantly changing. So if you watch this two, three months from now, who knows, maybe different. I hope not. I hope that this still will provide a lot of value, but I feel like it’s good to be honest right up front. Throughout the demos, I’m gonna assume you’re familiar with a couple things like UDF and window functions. If you’re new to Spark completely, fortunately, this is a virtual conference, so hit pause, come back, the examples should have enough details that you’ll be able to run through though.

Challenges with Geospatial Analytics

So now we’re getting into the challenges with geospatial analytics. I think there are a number of things with geospatial analytics that are kind of step zero in getting started. And I mean, data science wrangling data is already like 80% of the problem. There’s this whole part of geospatial that goes before we even start wrangling data. So some of those things are projections. Lat, lon, elevation, how does that actually represent a point on a globe? Indexing that’s usually pretty easy with strings and floats, but that becomes a little bit different when we talk about geospatial data. Finding and curating data, shockingly hard. Error trapping, lots of different ways geospatial data can go awry. And then lastly, we’ll talk a little bit about system libraries. So I’ll put this at the end that’s the easiest way to do geospatial analysis is for it to not be geospatial. I said that kind of jokingly, when I first started you doing geospatial work. I feel like I say that more seriously now. Geospatial stuff can be a little bit of a rabbit hole. If you can convert it into a different space where there’s a little bit more provenance, sometimes things get easier.


So let’s start with projections. I think most of us can agree that the earth is not flat, but most analytics run in 2D space. When I first started doing geospatial analytics, I had this assumption that if somebody gave me a lat, lon and an elevation, that I knew where that point was. That’s not the case, there needs to be some sort of origin that grounds that lat, lon and elevation. And that’s what a datum is. A datum gives an origin for that sphere. And that datum can be really important depending on which region of the earth you’re in, sometimes those datums change. So if you’re pulling in data from China, and Europe and North America, they may have different datums. And it’s not challenging to switch between those, but it’s really important to know which one you’re receiving your data in so you don’t make some major errors. I think going from like NAD83 to 27 if you’re doing some analysis in Northern California, you might get a three degree difference. So even though your point on the actual physical globe is the same, your lat, lon point is completely different. So that’s one of those things that was like step negative one for me right from the beginning. That took a while to fully understand and that’s critical, especially as you’re doing global scale analytics. The next thing, I think there was more familiarity around for projections. This is kind of how we represent the globe in a 2D plane, and that makes math a lot simpler. So all projections are wrong, just kind of get that out there. Don’t think that there’s a perfect one, some will be better for different scenarios and you could spend an hour talking about projections alone, but what we found is that all projections are wrong, pick one that makes sense, record it and move on. Be consistent about it. We found in terms of a protip, if you’ve projected a geometry in a database, or something like that, adding the projection ID to the column name can alleviate a whole lot of errors in the future.

So indexing, generally, we wanna do something with this geospatial data and we wanna join on it or we wanna search, see who’s in what regions. So point-in-polygon searches are really expensive, especially when the search space is not limited and that’s pretty much what you get when you don’t have an index. So it may not be clear when data is not indexed. In general, in Spark, it’s just not indexed, you just have the data there, but if you’re used to using some other platforms like elastic, or SQL, sometimes those indices get generated for you without you realizing it and that’s what can make your data really quick and appear really slow inside of Spark. So there are lots of options when it comes to indexing, you can do geohash, which is what elastic uses natively, that puts the rectangular grid. This is really handy because you can move up and down on the resolution scale just by changing the number of characters you’re looking, but its grid is not constant across the globe, because of those projection issues that we’re talking about. Quadtree is pretty similar. And another one that we’ve used, and we’ve started to like quite a bit is the H3 indexing system which uses hexagonal grids. And while it doesn’t give you the ability to go up and down on the index tree as easily, there are functions to do it but it’s just not as simple as dropping a character. It does have constant grid cell size across the globe, which does make rolling up data, doing summations and things like that a lot more relevant. It’s also possible to index the base RDD within Spark, and I generally reserve for master’s cases, ’cause most of our colleagues are really familiar with that one. And doing that indexing generally requires some knowledge of scalar or of the back end of Spark dataframes.

Finding and Curating Data

So finding and curating data, this is always one of my favorite topics, because somebody says, “Oh, just go grab the census data, “it can’t be that bad”. That usually leaves me laughing because there are tons and tons of different formats this data may come in. It may have different projections and datums, like we had talked about and it may be stored in various ways. So you might have something in JSON format, a shapefile, a geodatabase, a raster, a CSV, you can have all these different formats, and ultimately, we just wanna be able to bring that data together into a cohesive way. So what we found in terms of our process is, we’ll read the raw data in whatever format it’s in, then we validate the geometry right off the bat, and we try to validate geometry as many times as possible. And then we convert this over into well-known text. Well-known text is a string representation of a geometry object and that just makes it easier for us to store that in some sort of flat-file like a CSV or Parquet. I put both of those here because we do store both of them, usually. The CSV is generally easier for us to get to initially from like a geodatabase, but a Parquet offers better rewrite capabilities, better compressions, things like that. And in our demo, I’ll go over how we go from a JSON file to a CSV file.

System Libraries

So in terms of system libraries, you can use cluster init scripts, or Docker containers to install the low-level libraries, things like GDAL or whatever it may be. A lot of the higher level libraries will use those lower-level libraries to do the actual math piece. We do that when necessary, but we avoid it when possible. We found that the Docker containers do work best in most cases. We’ve also found that packaging wheels up for our Python files and deploying those has made life a lot easier, especially when integrating tests. I’ll talk a little bit about that in our third use case. And then in terms of some useful geospatial libraries that we’ve leveraged and had good luck with, GeoPandas is great, SK-mobility is also another fantastic one and Moving-pandas is another fun one to use. I grouped those three together because they’re Python libraries. So they’re not natively Spark. So you need to wrap these in a pandas UDF or something and make sure that you understand the trade offs in terms of performance on that.

The other two that are really fun are Spark libraries, RasterFrames and GeoSpark. Both of these libraries, we’ve had great success with. GeoSpark is pretty well used and they’re under active development. So it’s a great library, it provides some of those advanced indexing capabilities that we had talked about. So I find that the important thing with libraries is finding the balance between user knowledge and compute time. So a user might be very familiar with pandas and they might be able to bust something out really quickly, but we’ll have to pay for it on the compute side. Sometimes that makes sense, sometimes that doesn’t, that really just depends on your users and your compute budget.

Examples – Indexing

I’m gonna skip over the indexing. Needless to say it’s important.

And now we’ll jump into some of the case studies.

Large Scale Geospatial Joins

So in terms of large scale geospatial joins, this is kind of where I started with geospatial stuff. I feel like this is pretty common. You’re given a huge amount of geospatial data and someone says, “Can you just join these together and do some sort of summation?” So that’s where we’ll do our first demo. I’ll walk through how we get our JSON files into CSVs, how we load them in, create the index and create a usable dataframe back end Databricks. So now we’ll transition into the demo.

So what we’re gonna talk about in this demo are large scale geospatial joins. This might be something that would mirror workflow in a SQL environment, but there are a number of unique issues that arise when we’re working in Databricks. And in this case if you had this data already in SQL, it would be great, it would already have an index, you could do your quick join and then things would work out. The issue is that if you’ve got a dataset, that’s a couple hundred million records, and you wanna do a one time join to figure out some sort of analytics, it doesn’t necessarily make sense to stand up in an enormous SQL server to meet that need. And that’s where Databricks can come in and really save the day, but that introduces a few extra issues. The first are data formats. Data formats are always a little different. We’ve talked about this. In this case, our buildings are gonna be in JSON, the shapes are gonna be or the blocks are gonna be in shape. I’ll show how we get the JSON files and convert those into the CSV that we’ve talked about before, with a well-known text string. That’ll get us into a common format that we’ll be able to read in with a native Spark. It reads standard CSV so we can read the entire directory and all the buildings in a distributed manner. Next, we’ll talk about indexing. Indexing is critically important in geospatial terms. Without an index, it’s gonna do the full join and try to see what matches. The index is what allows it to dynamically subset your data. But we’ll talk a little bit about join performance and then output storage.

So in terms of steps to get this done, the first thing that we’re gonna do is, we’re gonna convert our GeoJSON file into CSV. Then we’ll read those CSVs into a dataframe. I’ll gloss over this a little bit, in this case, just in the interest of time, but we’ll show you what’s going on, then we’ll create a geospatial index, do the join and then finally, we’ll write output back to a Delta table that can be used in future parts of the ETL process. So in terms of the GeoJSON files that we’re gonna use, we’re gonna use the Microsoft buildings dataset. This is a great dataset to start playing around with, it has a little over 125 million buildings in it. So you can really get an idea of how your algorithms scale using this dataset. This is pre-separated out by state, which is pretty nice. So if you know you’re working with a single state, you can start there and then you can quickly spin up for the entire US. I will say that OpenStreetMap also has buildings available.

There are sometimes differences between the two. It’s up to you which ones you wanna use, which times. OpenStreetMap sometimes has more attribution. This is really just a set of outlines. So it’s good to play with the two. So what I did in this example, I just downloaded this District of Columbia file. It’s a zip file that contains the the GeoJSON. And what I’m gonna do is just show you really quickly how we can convert that into a CSV file. So I’m gonna open up OSGeo4W. This is simply a Windows tool that I use that essentially just has ogr2ogr and GDAL installed on it. If you’re on a Linux or a Mac system, installing those tools are pretty easy. If you’re on a Windows system, like a few other things in windows that can sometimes be a challenge. So this tool I found simplifies things quite a bit. So what we’re gonna do is, we’re just gonna go into that directory, and you’ll see that there’s a GeoJSON file. So what I’m gonna do is I’m gonna call ogr2ogr, and I’m gonna ask it to convert this into a CSV. I’m going to give it the the output name.

And then we’re gonna ask for progress bar, and then I’ll just give it the file name. So run that, we’ll see the progress thing starts, it runs pretty quickly. And now what I’ll do is, I will open that up with VS code And you’ll see that it’s empty. Now, this is the first thing that I’ve run into and it’s always a little confusing, you have to add this extra little flag to tell it that you want the geometry. Equals as well-known text. So now if we give it that command, it’ll know that the geometry objects inside the GeoJSON, we would like that to be represented as a well-known text column. So you’ll see this should update in just a second. And there we go. There’s our well-known text. So I will say that this tool is not perfect when it comes to creating well-known text strings. I’ve had issues reading these and if you consider these to be golden and perfect, you’re probably in for a bad day. As part of our ETL process, the first thing that we do is, we check to see if this is valid. If we can’t deserialize this and there’s an issue somewhere, we replace it with a .00, just the most basic point you can use. The next thing that we always do is, we run it through a buffer. And that’s because the buffer will fix any sort of loop issues, or if different points don’t align, it’ll clear that up immediately and then we can run the rest of our process on that final cleaned well-known text. So at this point, we have a directory of CSVs. We’re gonna assume that we’ve run through that validation process and read that into a raw table that has our buildings. We can follow the same process with the shapefiles for the blocks. And that’s what we’ve done. We just store those as parchive and those are available. One of the things that we’ve realized early on is that getting geospatial libraries into Spark sometimes requires a lot of libraries. We’re using Scalar here. And to make our life a little bit easier, we created this helper function Notebook, and that just imports a whole bunch of libraries. You can look through these. The main ones that are important in this case are Geo and Spark. We also use the location tech stuff, that’s pretty handy. We have H3 there as well, if we wanna do some H3 indexing. So here we’ve got… We’re just gonna refresh our tables, and then we’re gonna start doing our partitioning. So I’m gonna say that we’re gonna do 1000 partitions, you can play with that number. Sometimes more or less makes sense, I found that 1000 seems to work pretty well for what we’ve been doing. We’re gonna read in from that geometry buildings that we created from our ETL process. And then I’m just gonna create a blank RDD using SpatialRDD from GeoSpark SQL. So now the buildingRDD, the rawSpatial part, what I’m gonna do is I’m gonna take that building dataframe and I’m gonna convert it to an RDD and set that up as the rawSpatialRDD. The next thing is we’re gonna deal a little bit with projections. There’s no projection with those CSV files. There’s no projection inside of these tables that we have, but I’m gonna set the projection either way. I know that it was 4326 going into it and I’m gonna set it just as the same so that the index is aware of what’s going on. So the next thing we’re gonna do is kind of run the Analyze command. This will get counts and bounding boxes and then we’re gonna run the spatialPartitioning function. So the spatialPartitioning function will do a k-dimensional tree with the number of partitions that we specify here. Then the next thing we’ll do is, we’ll run the buildIndex, this will actually build the index on our RDD. We’ll do that with quadtree, and then I just buildOnSpatialPartitionRDD is set to true. So this process takes about 14 minutes on 125 million buildings. Our cluster here is pretty small, I think it goes up to four machines with four cores each. So this is not a very expensive operation. And you’ll see that now we have an indexed spatialRDD. The next thing we need to bring in are the blocks, so we’ll do pretty much the same thing. So we bring in the blocks table, create the RDD, and then same thing toRDD, spatialRDD, set the CRS. The really important thing here that’s just a little bit different is that the index must be the same between the RDDs that you’re trying to join. So here when we specify the spatialPartitioning, we can say that we’d want the k-d tree with the number of partitions. Here, even if we were to specify the exact same values, we’re gonna get a different index and the toRDDs are not gonna be able to join. So what we’ll do here is we’ll actually pull the partition from this RDD and we’ll say we would like that to be used as a spatial partition. So we’re good there. It doesn’t actually matter which index type you use here. It’ll work just the same. So I tried our R-tree here. And then we set this to true, buildIndex. And I think this is 11 million, that says 11 million census blocks and that runs in about 3 1/2 minutes. So here I just do a count to get an idea of how many objects they’re gonna be. So there are 11 million blocks and then there are 125 million buildings. And now comes for the fun that took me a little while to figure out ’cause I’m not a scale expert, and the first couple times that we did this, it was a little bit of a challenge. So we need to do the JoinQuery where we’re gonna do the SpatialJoin between the buildings and the blocks, and then that’ll give us the pairs. Then we’re gonna convert that over to a dataframe. And then that we’re gonna send Spark in there, and then we’ll build the RDD. And then what the issue with this is that, this doesn’t have any column names on it. So if we look at this output, it has the geometry and all of the raw columns, but it doesn’t have the names. So the last thing that we need to do to get these column names back or we just create a sequence of these two original dataframes ’cause that’s what we’ve joined. And then we’re good to go. So this will give us the string and then you’ll see here we’ve got the block_geometry, the area, all the columns from the block, and then the building and all of the columns that we expect to have there. So this is now in a Databricks or a Spark DataFrame, but then we can do whatever you would expect to do with a Spark DataFrame. So we can take that, we can write that out as a Delta table. I always run an optimize, if it’s a Delta table to repartition, the Parquet files that are written out. We could also have added an H3 column in the front of this. That would help if we’re gonna do other joins in the future to kind of maintain a little bit of a geospatial index on those files for future work. So I hope that helps. That’s a quick overview of how we go through doing our geospatial joins. If we have the ability to create an index like this, if it’s gonna be a one time thing, GeoSpark is a great library, it’s very active, they’ve got really great tests built into it, and I highly recommend checking it out.

So now after that demo, we’re gonna talk a little bit about spatial disaggregation. I’m gonna kind of blow through this because when I did the demo for this, this was the one that was 30 minutes, but this is what allowed us to create these kind of beautiful maps that you see here.

Spatial Disaggregation

I will say… So the use case here was that as an analyst, they wanted to know a map of where personal protective equipment was needed across the US at specific resolutions, so we needed to create that map. So the challenge was we had some sort of intensity data, then there was some sort of magic that our sponsor wanted on the map on the other side.

To do this, we kind of followed this workflow, we generated some NAICS intensity information, which is around employment categories, then we were able to use some information from the county business from the census, they create this county business practices data that gives a really good idea of what’s happening at each county in the US. So it gives us county-level intensity, then there’s some other data that we can bring in to get block level information. So this is pretty fine in the US. And then we use one of the geospatial indexing methods that we talked about earlier, which is the H3 grid. Once you have the H3 grid, it’s pretty trivial inside of Databricks to use your summations, create xyz coordinates, and then you can generate the map at the end. So if there’s interest in that demo, I’m happy to share that link out, it’s half an hour. So skip over the demo, and now we’ll go over the pattern of life case study. So in terms of pattern of life, the use case here is that as a research and operations team, I would like to understand patterns in geospatial data.

Pattern of Life

This is pretty common, a lot of sponsors ask us this question. So this is something where researchers need to develop new algorithms, and then our operations team needs to leverage those algorithms on new data and at scale. So the question is, how do we can connect these two groups in an effective way?

Entity Dataframe

So we’ve come up with a couple domain classes as what we’ve defined them as. So an entity or path dataframe, depending on how you wanna call it. And this is a class where there’s a known schema and a series of spatial transformations. And I’ll go over how we apply these in the library during the demo. So what’s nice about this is that, there’s some sort of commonality where there’s an entity_id, a lat, lon, and a timestamp, and those are all things that other users of the library can build on.

Polygon Dataframe

We then also have a polygon dataframe, it’s another domain class as we call it, that has some well-known text representing the polygon, the name, description, source, things like that. And then we apply transformations on to this that allow us to do our all over geospatial indexing and makes people’s life easier.

And then lastly, we have a pattern of life which helps us combine the the polygons and entities in interesting ways to understand what users are doing.

Pattern of Life

We anticipate expanding this library out to meet other other domains that build upon these other use cases.

So now we’ll jump into the demo, we’ll go through just a little bit of in the library, what we think has been pretty useful, and we’ll pick that back up.

This demo was originally gonna show how we’ve generated spawn locations for users, but we’ve done enough Spark queries, I think, in these demos already, that it might be more valuable to dive right into the code, and show you a little bit about what we’ve done, and why we’ve done it and some of our design decisions to help you understand if this is gonna work in your situation, or if you wanna collaborate with us on this library. Let’s just jump right into the code. So I’m gonna jump into our path dataframe, those that are familiar with SK-mobility, this will look kind of similar to a trajectory dataframe. What we have here, just some basic documentation that we say we need some columns here, so a user that’s gonna instantiate this knows that they’re gonna need some some lat, lon, entity, a timestamp, and we just define our arguments. So in terms of the init function, this requires a Spark DataFrame coming in and then just some basic information on those other columns. So the lat, lon and timestamp. We don’t assume that those are gonna be constant names. We allow the users to specify them, and then we will make adjustments immediately. The other thing is that we ask for are information on what resolutions the user would like this to be indexed at. So just to arch the resolutions, this helps us avoid any issues where a point might be right near the edge of an H3 hash. So we’ll set the raw dataframe equal to the dataframe that the user has, and then we just instantiate some other dataframes that we’ll potentially have inside of the class. So right off the bat, we’re gonna go ahead and standardized column names. This is an issue that we’ve had multiple times and I can’t tell you how many hours we’ve lost trying to figure out what the name of a lon column is, or trying to make sure that a timestamp column has the right type. So the first thing we do is we standardized column names. So we’ll go ahead and we apply a pretty simple transformation to the original dataframe. So this is gonna just go through and make sure our types are correct, and it’s gonna set our timestamp, it’s gonna also create a long timestamp. So this is really handy for future algorithms. And then we set that back to raw dataframe. So one thing that I’ll point out here is that, we’ve wrapped this in this capture_transform function. This is another thing that we’ve created as part of this library to help understand logical groupings behind these transformations. Spark keeps track of what all of these transformations are, but we wanted to be able to know what part of our library made that adjustment because our goal is that we’re gonna chain a lot of these things together to develop more complex algorithms. And because this is Spark, these things aren’t materialized until somebody calls something like Show or Display or Save. So we wanted to know where to look. So all this capture_transform does is it says, what’s your original dataframe, this helps to understand any differences between the dataframes, we look at the schema and look for changes. And then it asks for a name of this transformation. So I just named this path column standardization, and then you provide column descriptors. So that’s what’s up here that I kind of blew past. So this just helps the the end user know a little bit more about those columns, so instead of just having lat and lon, and timestamp, these are all pretty self-explanatory column headers, but as you do more complex algorithms, some of these column headers become a little bit more nebulous. So this is just a really great way for us to add some descriptions. So when this capture_transform is called, what it’ll do is it will store all of those transformation steps, all the metadata around them, the differences between the schema and it’ll attach that to this raw dataframe as a metadata object. And then we wrote some helper functions that when somebody writes out one of these dataframes, all that metadata is also persisted into a database. So that’s searchable in the future, so you can very quickly search for a column name and find out which transformation was responsible for generating it, and also what the full description of it is. So it’s really cool here. Now I’m gonna go through and show just two other quick functions. So one of the things a user might wanna do is they might wanna actually get the hashes. So what we’ll do here is we use the raw dataframe, and we call, essentially a transformation that we’ve written that will do the H3 hashing. This also captures all the metadata it transforms, it has its own name, all that kind of stuff. So now we have a hash dataframe, and what’s fantastic here is, we will go through and we can start to chain these things together. So maybe a user is interested in getting speed. So if somebody calls the get_speed_data instead of having to figure out how they’re gonna do all these other steps, how they’re gonna standardize the names, all those things, the user can very easily just call this get_hash_dataframe function that will put all of these transformations in with it, and then they’ll get exactly what they need, all the metadata is there, all of it correct. And this makes it really easy for people to build upon these dataframes. So I’ll jump into this really quickly and just kind of demonstrate how some of these work.

Again, documentation we feel like it’s really important. We decorate it with CaptureTransform, so this will also keep all of the metadata around this transformation. We tell the user what kind of columns are gonna be required. This is in case it’s not used in the library in the way that we might expect. And then in the same way in the other way where we describe the columns that are added, we just put those in the doc strings here. So when it’s generated, a user will know that and our capture_transform function will parse through doc strings to capture these descriptions and if descriptions are missing for columns, we throw errors, so it doesn’t actually compile. So that’s basically a high-level overview of what we’ve done and how we’re able to chain these things together. This makes it, like I said, really easy for somebody to come through and say, “Okay, I wanna aggregate some speed information up”, they can be very well assured of what is gonna be in that speed dataframe and just write their couple aggregation steps on top of that.

So in conclusion with this, we found that having classes for the domain makes chaining a lot easier. It allows people to build on the work of others a lot more rapidly. Once you’re pretty confident of what the structure is of a dataframe, you can build upon that and when you’re confident that the previous parts of that are gonna be all in the right order and correct, it makes the rest of life a lot easier. Unit tests, we have lots of unit tests for all of our modules that ensures that we don’t have any regressions, make sure that the contracts between functions are there and in place. We’ve also implemented a way of adding sample data into the library. This helps us explain the algorithms that we currently have and it also helps us develop new ones. People are always coming up with new interesting use cases, and we like to throw that sample data in there so that people can can work against it. As I mentioned briefly, documentation inside of what we’ve been doing has been pretty critical using the doc strings, but also having a use case pipeline has been really helpful. So this is where we’ll write Python, and as part of our build process that Python gets converted into a Jupiter Notebook, and then this Jupiter Notebook gets converted into an HTML file that can become part of our documentation. Also pretty critically with this, that Jupiter Notebook can be very quickly imported into Databricks. That makes it very easy for somebody to find a use case that is similar to what they’re trying to do and extend it. And lastly, that metadata capture piece has been really handy for us to understand what functions are doing what and also as we share this library around with other projects, they’re able to very quickly get descriptions for each of the columns that they’re looking at. So that’s a quick overview of what we’ve been doing with our pattern of life domain class stuff. I hope that’s helpful, and I hope that’s a good overview.

So real quick, just lessons learned at the end, I feel like there are lessons learned at the end of everything and geospatial is no expected exception. So felt that standardizing on data formats is really important. This can be a huge red flag. Datalakes have been really useful for us to be able to store that raw data to be able to pull from it for different projects. The domain-driven design that we’ve been doing with our library has help save a number of headaches and allowed us really to reuse code at a much more effective scale. I’ll say that Notebooks are incredibly useful, but they’re not a panacea, especially as you start to talk to different development groups.

Test and scaling often is super important, especially when you’re doing small scale to large scale data. It’s easy for your small scale data to look like it’s going really well, but then you realize you don’t have an index and your large scale data explodes. Landscape is always changing, like I said, so it’s important to spend a lot of time keeping up. And lastly, your problem is probably not a unicorn, gotta leverage the talents of others to deliver the real impact. And that’s where the Python libraries and other geospatial libraries are out there. Have been fantastic, leverage ’em, use ’em, be productive, they’re great.

So that’s, that’s everything. Thanks so much for your time. I hope this was helpful.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Dan Corbiani

Pacific Northwest National Lab

Dan Corbiani is a Data Scientist and Solutions Architect who designs, develops, and deploys analytic solutions for research programs. His primary thrust area is the intersection of large-scale geospatial processing and Spark. This is within vector-based datasets such as critical infrastructure assets or entity paths. Dan has been working to implement distributed geospatial algorithms for pattern of life analysis and disaster response. He has implemented common geospatial algorithms such as DBSCAN and Getis Ord Gi* within the Spark framework. Dan has a long history with software development with a few tangents into Materials Science and Systems Engineering. This has allowed him to understand the requirements of the researchers as well as the implementation options in the cloud.