As new geospatial data sources come online the variety and velocity of this data makes it increasingly difficult to find the answers to intelligence problems manually. In this talk, the Booz Allen team provides insights into how we use the Databricks ML Runtime coupled with geospatial libraries like GeoMesa to accelerate GEOINT workflows. By combining these powerful tools, our team uses computer vision algorithms to extract data from imagery, build patterns of life with time series analysis and co-traveler analytics, and answer key intelligence questions in a digestible way using Databricks built in notebook and dashboard visualizations.
– Hi everyone, my name is Donald Polaski. I’m a chief data scientist here at Booz Allen Hamilton, and today we’ll be presenting a talk on some of our work around geospatial AI. So, this is a topic that’s very interesting to myself and my colleague, Michael Gasvoda. He’ll be presenting with me today. So let’s go ahead and jump right into it.
So, the agenda for today’s talk is really split into two blocks. First we’ll have kind of a general overview of some of the work that Booz Allen does in geospatial AI. We’ll talk about some of the key challenges that impact geospatial use cases, as opposed to your traditional AI and machine-learning type activities. And then we’ll introduce some of the solutions that we bring to the table, in conjunction with data bricks, in conjunction with Apache Spark, to really overcome some of those challenges. After that, I’ll pass it off to my colleague, Michael Gasvoda, who will do an in-depth deep dive into a topic of interest for our research team lately around trying to use Black Marble, which is a NASA sensor that looks at night lights to determine the impact of COVID-19. So, I think it’ll be pretty interesting to see, kind of at a high level, some of the work we do in geospatial AI, and then, we’ll give you a deep dive into an active area of research.
So when we think about where Booz Allen is applying geospatial AI today, I think the use case that everybody typically thinks of is around computer vision, being able to look at an overhead image, and identify, hey, where are the buildings, where are the planes, where are the trucks, where are the cars? I think one of the great examples of applying computer vision and AI and ML to the geospatial use cases is the xView dataset. So, for folks with a geospatial background, you’ve probably spent some time taking an algorithm like YOLO v3, applying it to an overhead image, and getting an image like you see on the right, right? Where you’ve got, not just your image, but you’ve got the predictions coming out of these AI models. But, when we think about kinda the full breadth of the problem space for geospatial AI, it’s not just your typical computer vision applications. Our team does a lot of work in anomaly detection. We do a lot of work in co-location. So we have interests in, hey, are these, is this group of people typically co-located? Are they traveling together? Are they crossing borders together? Are they kind of having the same patterns of life? And then we’re also very interested in how do you take natural language processing, how can you take something like an NLP algorithm and apply it to a document that has information, geospatial data, right? Pull out the relevant information. You might have a document that says something like, “Hey, we need to go investigate this area over here, “three miles away from Washington, DC.” Well, for someone reading that report, that’s pretty easy to think about. What’s three miles east of Washington, DC? But for a computer, you can’t just feed it that data, and then expect to get a point to show up on a map. You’ve really got to be able to provide, through natural language processing, the ability to add context to that data, and to take the, kinda the specific information that’s represented in there in text, and turn it into something that’s machine-readable, turn it into something that you could then feed into downstream, geospatial analytics.
And so when we think about the challenges that we typically face when working on these geospatial problem sets, I think the first and foremost is the scale of data. Over the last 10 years, we’ve had the advent of the internet of things. We have new ways to collect geospatial data, right? So, new sensors, new satellites, we’ve got a lot of commercial imagery that’s coming in these days. And so, because each of those images might be dozens of gigabytes, you’re able to get up to terabytes and petabytes of data very, very rapidly. And also, with everybody kind of carrying around a GPS in their pockets, right, that gives you a lot of information that, again, everyone’s essentially become part of the internet of things and it’s generating just that much data. And so, it quickly becomes untenable to solve geospatial problems at the terabyte or the petabyte scale on a single CPU. Even pulling in super computers, right? You might be able to get to a terabyte of memory, but it’s gonna be incredibly costly, and it’s gonna be quite the engineering challenge. I think the flip side to they add is, as we think about computer vision applications in particular, and some of the NLP applications of deep learning technology, we’ve also seen a real need to embrace GPUs, right? To be able to accelerate the training of these models, accelerate our ability to actually inference as that data is coming in. Kinda the second key challenge that we’re thinking through is around geospatial optimization. And so, something like a library like Pandas, out of the box, not gonna be great for doing geospatial data analysis. It doesn’t really have the concept of geospatial indexing built in. And without those geospatial indices, it’s gonna be really hard to write a SQL statement, or to write a query that allows you to find all the data within a certain radius, or look at the overlap of two datasets and pull out just the areas where the density of points is comparable, or something like that. Without having those specialized indices, a lot of the work that you’re trying to do from a data-science perspective is just not going to be feasible. It’s gonna take too long. It’s gonna be time and cost-prohibitive. Then kinda the third piece is, as you’re working with this geospatial data, everyone is pretty familiar with how to work with a JPEG, or a PNG, but for some of these imagery applications where you’ve got this roster data coming in, you’re dealing with things like GeoTIFFs and MITIFFs, and these specialized file formats that are more than just the pixels, right? They also have a lot of metadata associated with them, which can be used as part of training these AI models, which can be used as part of your downstream processing. And then obviously, on the vector data side, you’ve got things like GeoJSON, KML. These are standard for anyone working with geospatial applications, but, if you don’t have a tool that really understands them, it’s hard to spin something up that’s gonna be able to take a GeoJSON, get it ingested, and get it spatially indexed so that you could actually run your AI or your ML workload on it. Now that we’ve talked about some of the challenges around geospatial AI, let’s talk about some of the solutions.
First and foremost, the advent of GeoMesa, GeoTrellis, GeoPandas. These libraries are tuned to work with Spark, and they’re tuned to work with geospatial data. So being able to pull these in and integrate them in our spark environment has been essential to enabling these analytics at scale. Spark itself, right, really simplifies the process of running these analytics in parallel. So it’s a no-brainer, right? If you want to be doing terabyte or petabyte scale analysis, you’ve gotta have something like Spark to do things in parallel. Then finally, we are seeing a lot of acceleration with Databricks. So, a lot of quality-of-life improvements. Databricks will come with a built-in machine-running runtime, which enables GPU acceleration out of the box. So that really simplifies the process of standing up your GPU environment, gets you to GPU deep learning much faster than having to stand it up on your own. And then capabilities like Delta Lake and ML Flow really allow you to, number one, track your data versioning, and number two, make sure that you’re tracking each training cycle of your models, right? So as you’re going through and doing that hyper-parameter tuning, you’re able to really track that much better than ML Flow. Previously, we had folks tracking that stuff in text files on their laptops. Now we’re able to do it kind of at enterprise scale. So, now that we’ve talked a little bit about some of the general problems that Booz Allen tackles in geospatial AI, I wanna to turn it over to my colleague, Michael Gasvoda, to do a deep dive into some of the research that he’s leading for us. – So, the use case that we want to explore in depth today is a type of land use classification using NASA’s Black Marble data that Don talked about a little bit earlier. More specifically, we want to see if we can get an idea of what areas have been heavily impacted by the coronavirus pandemic and the related quarantine orders that have gone into effect by using this light radiance data. So we’re looking at land use classification and really a bit of change management, or change detection, in this data through time.
So Black Marble has been around for a couple of years now, launched by NASA back in about 2011, to explore radiance coming off of the Earth at night. And so there are a lot of causes for radiance that kind of make deep analysis like this challenging. So obviously, you have human-produced light, but a lot of times that’s masked behind other sources. So things like moonlight will have a significant effect, as well as things like particulate matter in the atmosphere, or backscatter in between the ground and the sensor. And you also have natural effects like seasonal variations in the amount of vegetation on the ground, the presence of snow, or the presence of water on the ground, that will all obscure what we’re trying to really explore here as human-caused light. Luckily for us, NASA has done a lot of fantastic work in preprocessing that out, and normalizing to only look at human-produced radiance from ground level. This sensor is on a daily orbit. We have daily values for nighttime radiance, going back to January 1st of 2012 globally. So there’s a wealth of data available for this type of analysis. It’s already been used for a variety of publications in diverse fields, including things like disaster relief and recovery, environmental monitoring, a lot of atmospheric science, doing things like looking at the effects of gravitational waves in the atmosphere, which is beyond me, but super cool. Looking at monitoring carbon emissions and socioeconomic analysis. And that’s really where our project comes in. So more specifically, we want to identify places like heavy industry areas, entertainment complexes, and business districts, things that will have a lot of initial output radiance. These are high-use areas that have a lot of light output, and that have often seen a significant reduction in usage as we’ve gone into quarantine. We want to see if we can identify these areas in an automated way, to assess the scale of impacts globally. For the purposes of this analysis, we’re gonna focus on the US east coast. So we’re gonna take about one 10-degree-by-10-degree square on the globe to prove our concept, and then we’ll look at expanding this globally.
So the two primary factors that we’re interested in exploring here are the size of the initial output. So as I mentioned, we’re interested in things like heavy industry areas or downtown districts, places where you’re gonna have a lot of light coming out of them during normal use. And we’re also looking for the scale of the impact over the past few months. So we’re looking for a significant decrease in output over time. So look at this. This is pretty easily formatted as a linear model. So you have an intercept and you have a slope, right? So if we wanted to take a naive approach to this, we could just take our first point and our last point, our first point is our intercept, and then you have your nice simple, your rise over run, that makes your slope, and you’re done. But at the end of the day, we’re still dealing with sensor outputs. And while NASA has done a lot of work in cleaning up this data and making it ready for analysis, you get a large amount of variance in there.
So if you look at the graph on the top-right of the slide, this is one pixel over that 152-day period that we’re analyzing, from January 1st to May 31st. And so you see there’s a lot of variance in the data. You have some pretty clear outlier values that pop up over time, and we need to be able to be resilient to this type of variance in our data. So, we ended up going forward with mapping linear regression model. So this is a very simple model. You’re talking about your X input being just the day of the year, and your Y input being the output radiance, but it allows us to be a little bit more resilient to the amount of noise in our data. Now, when we’re getting into modeling, even though we’re dealing with a pretty simple model, this really starts to grow the size of our problem. So, because we’re treating each pixel individually, we’re dealing with a lot of individual models. In something like computer vision, you’re gonna be dealing with the entire image throughout one model, to see how pixels interact with each other. Here we’re interested in isolating the pixels, and analyzing each individually, because we’re interested in the way that each spot on the ground pays, independent of the others. There’s still gonna be some spatial relationship there, of course, but really we want to look pixel by pixel. And so when you’re talking about a 2400 by 2400 image, you’re getting into six million or a little under, individual pixels that need to have their own models fit to them. So this is where we start to get into the realm of dealing with parallelization. And we’ll get a little bit more into that in a second. For now though, with our regression models, the output of those is, again, our intercept and our slope. So our initial light output, and the way that has changed or trended over the past six months.
When we have these output values and plot them, we get our plot here on the bottom-right of this slide. So this is all of those slope and intercept values for each individual pixel in the image. So generally, almost every pixel in the image has seen a decline in our radiance output over the past few months, but some have had a much larger effect than others, particularly those with high variance. So if you look at that area circled, that’s really where we’re interested in. Those places that have a really strong decline in their radiance output over time, but started pretty well, started pretty high.
So, we were able to take our output values and pass them to a simple clustering mechanism. So we just used a k-nearest neighbor’s model to isolate those pixels for us. This is something that can be done entirely unsupervised, and doesn’t require a lot of hyper-parameter tuning to isolate the area of interest for us, especially on a first approach to prove the concept. And so we’re able to isolate those pixels and turn this into a mask over our input image that identifies for us where we are seeing these values that can be easily plotted onto a map, so we can visualize those results.
So going back a little bit to our parallelization, as I mentioned, right now, we’re not talking about a huge problem, but we’re starting to get into the realm of some decent size, right? So we’re talking about six million individual regression models that need to be fit for a 10-degree-by-10-degree square. If we wanted to expand this model globally, we’re already talking about a week, just to fit regression models. And it’s a simple linear regression. If we wanted to do things like go back further in time, or look further into the future, now we’re really starting to talk about a large-scale problem. And this is also kind of a weird use case. A lot of the current advancements in parallelizing machine learning focus on distributing tasks for a single large model. If you’re dealing with, again, something like computer vision, you’re gonna be breaking up that single training task into sub parts, and distributing those across your cluster. In this case, because each of our individual models is so slow, or so small I mean, we don’t want to necessarily break up the individual training tasks for each regression. We’re gonna lose overall because of the overhead of coordinating between different workers. So what we were able to do instead is package up entire training tasks for each pixel at a time, and distribute those out amongst the cluster. So we used cypized linear regression function, and mapped that across an RDD of each of our pixel values, and then that entire training function is distributed across the cluster. Now even though this is a bit of a weird implementation, we’re still able to see significant benefits from doing this in parallel. So, for our simple 10-degree-by-10-degree square, what took 30 minutes on a laptop was down to five minutes running on a really small three-node cluster, one of those being the master. So you can imagine, as we start to expand this out, and we’re talking about doing this globally, or doing this over a longer time span, we’re gonna be seeing significant benefits from running this in Databricks on a managed-cluster instance, where we can take advantage of those optimized run times, and allow us to scale. When you’re dealing with something like the coronavirus pandemic, where you need time-sensitive analysis to be able to drive policy responses, being able to be responsive and iterate quickly on this problem set is really crucial. So operating at a global scale really requires this parallelization, even a fairly simple modeling case.
So, getting to our results, we had some pretty promising outputs from our simple modeling efforts. So, on the left-hand side here, we’ll focus on this one first, this black box is from that pixel mask that we applied to the image. So this is identified as one of those pixels in that sort of quadrant of interest that has a high initial output radiance, and a really steep decline over the past six months. In this case, the location on the ground that we’re mapping to is the Paulsboro Refinery in Pennsylvania. So this is exactly the type of thing that we’re looking for. This is something that we expect to have been heavily impacted by quarantine restrictions over the past six months. This is a heavy industry area where people aren’t able to necessarily come to work. Now, being a refinery, we also had a massive crash in oil prices during that period, so there may be some intervening socioeconomic variables here. But it’s certainly the type of thing that we expect to find from this analysis. On the right-hand side, we’re dealing with some of those entertainment complexes. So this is Atlantic City, a major resort area in the Northeast. And the cluster of pixels in the top is the Borgata Casino, a major casino complex, where normally you’ll have a lot of light output from the hotel, lights covering parking lots, there’s a small marina nearby that serves the casino. So you have a lot of things that are gonna have a lot of high-intensity output. Over the past couple months, that’s been closed, so you’ve seen a significant reduction. And this picture here on the bottom is from March 20th, where the entire hotel is dark. You have another resort area down in the bottom-left, the Tropicana Resort, and then on the bottom-right, you have the Steel Pier, another major entertainment area, where you have things like fairground equipment, that are gonna be putting out a lot of light normally, but are gonna have significantly reduced outputs throughout the pandemic and the related closures. As we looked across the totality of our outputs, we saw a lot of similar things. We’re looking at heavy industrial areas, downtown business districts, entertainment complexes, ski resorts. These are the types of areas that have been identified by our model, using an entirely unsupervised approach. This is in one small area, but it definitely has some interesting applications, if we were to scale this globally. And there are a couple things that we are looking to do in the future. So one would be taking advantage of the length of time back that we have data available, to get a better baseline. So we could go back to 2019, 2018, to see what’s normal in seasonal variations of this data over time, so we can get a better idea of what’s different throughout this pandemic, and what areas have been more heavily impacted than they would have otherwise. We’ll have to control for things like new construction during those times, but it should give us an interesting baseline to compare against globally. A second would be extending this into the future, and taking some visibility on how areas are starting to recover. In some cases, like the United States or other Western countries, we have a pretty good idea of some of the expected recoveries, and we’ll have a lot of ability to monitor those through media. But being able to apply this on a global scale is able to give us a lot better coverage into the way that the world in total is responding, including areas that we might not have as good access to. So we can see how the world is recovering from the pandemic, and monitor for things like spikes in cases, or shutdowns returning in areas where they had previously been lifted, on a global scale.
– Well thank you, Michael. Appreciate the deep dive there. And just in closing, wanted to wrap up by saying geospatial AI is a kind of a core component of Booz Allen’s AI and machine-learning offerings, but we do strategic and technical advisory. We do design and implementation. We’re focused on ML ops. So operationalizing these models once they’ve been trained. And then we’re also partnered with the NVIDIA Deep Learning Institute to deliver both technical and non-technical AI training. So, if you’re interested in any of these topics, please feel free to reach out to Michael and I.
And here’s our LinkedIn information. So please feel free to shoot us an email or link up with us on LinkedIn. And thank you all for attending the talk today. And, as we transition to the questions here, just make sure to remember
Booz Allen Hamilton
Donald Polaski is a Chief Technologist at Booz Allen Hamilton leading the development of Artificial Intelligence (AI) solutions. As the Senior Data Scientist on a department of defense contract, he has helped build an enterprise cloud-based data science platform and is currently driving the development of new AI-based tradecraft and analytic solutions across the agency.