Geotrellis: Adding Geospatial Capabilities to Spark

Download Slides

Geotrellis is an Apache version 2-licensed pure-Scala open-source project for enabling geospatial processing at both web-scale and cluster-scale. The web-scale use-case [1, 2] is fairly mature, and allows analysts to run sub-second response time operations ranging from simple raster math (*, +, /, etc.) to fairly sophisticated raster and vector operations like Kernel Density and Cost Distance. The cluster-scale use-case is a relatively new effort to port Geotrellis’ rich library of algorithms to Spark, thus allowing for batch as well as interactive processing on large geospatial datasets such as satellite imagery. The effort is a collaboration between Azavea, the inventors of Geotrellis, and DigitalGlobe, a satellite imagery provider who had built a Hadoop map/reduce-based geospatial library [3], and recognized early (back in Spark 0.6.0) the potential of Spark to be the platform for running geospatial analytics at scale.

To the best of our knowledge, Geotrellis is one of the first if not only geospatial libraries on Spark. Being in its nascent stages, we would like to present the work we have done so far and get feedback and hopefully contributions from the vibrant Spark community. The talk will focus on three aspects of Geotrellis on Spark – i) how data is stored in HDFS, ii) how geospatial algorithms are modeled as RDD operations and also how we used some advanced Spark features like custom partitioners, and finally iii) performance numbers for both web-scale and cluster-scale use-cases.