Dan Corbiani

Operations Researcher and Data Scientist, Pacific Northwest National Lab

Dan has a diverse background that started in materials science and has always revolved around computing. I can usually be found at the intersection of data, visualization, and computational modeling. In the past I’ve helped modeling the energy use of millions of buildings, automated flood modeling to find parts of our nation are at risk of flooding, and performed geospatial analysis and joins for all buildings in the US.



Geospatial Options in Apache SparkSummit 2020

Geospatial data appears to be simple right up until the part when it becomes intractable. There are many gotcha moments with geospatial data in spark and we will break those down in our talk. Users who are new to geospatial analysis in spark will find this portion useful as projections, geometry types, indices, and geometry storage can cause issues. We will begin by discussing the basics of geospatial data and why it can be so challenging. This will be brief and will be in the context of how geospatial data can cause scaling problems in spark. Critically, we will show how we have approached these issues to limit errors and reduce cost. There are many geospatial packages available within Spark. We have tried many of them and will discuss the pros and cons of each using common examples across libraries. New users will benefit from this discussion as each library has advantages in specific scenarios. Lastly, we will discuss how we migrate geospatial data. This will include our best practices for ingesting geospatial data as well as how we store it for long term use. Users may be specifically interested in our evaluation of spatial indexing for rapid retrieval of records.

Dan Corbiani