Dan Corbiani is a Data Scientist and Solutions Architect who designs, develops, and deploys analytic solutions for research programs. His primary thrust area is the intersection of large-scale geospatial processing and Spark. This is within vector-based datasets such as critical infrastructure assets or entity paths. Dan has been working to implement distributed geospatial algorithms for pattern of life analysis and disaster response. He has implemented common geospatial algorithms such as DBSCAN and Getis Ord Gi* within the Spark framework. Dan has a long history with software development with a few tangents into Materials Science and Systems Engineering. This has allowed him to understand the requirements of the researchers as well as the implementation options in the cloud.
Geospatial data appears to be simple right up until the part when it becomes intractable. There are many gotcha moments with geospatial data in spark and we will break those down in our talk. Users who are new to geospatial analysis in spark will find this portion useful as projections, geometry types, indices, and geometry storage can cause issues. We will begin by discussing the basics of geospatial data and why it can be so challenging. This will be brief and will be in the context of how geospatial data can cause scaling problems in spark. Critically, we will show how we have approached these issues to limit errors and reduce cost. There are many geospatial packages available within Spark. We have tried many of them and will discuss the pros and cons of each using common examples across libraries. New users will benefit from this discussion as each library has advantages in specific scenarios. Lastly, we will discuss how we migrate geospatial data. This will include our best practices for ingesting geospatial data as well as how we store it for long term use. Users may be specifically interested in our evaluation of spatial indexing for rapid retrieval of records.