We’ve been writing Spark libraries at Uncharted Software for several years, and continuous integration has always been a difficult thing to accomplish with the Spark runtime. Common approaches, such as creating a Spark context within a standard Scala runtime, fail to accurately emulate nuances of the distributed Spark environment. This talk will focus on an architecture which I implemented for executing tests within a real Spark runtime on Travis CI, as well as for visualizing code coverage with Coveralls. This approach is critical to our transitioning from utilizing Spark as a scripting and data science environment to leveraging it as a component of a production architecture, upon which we can construct software which avoids regressions and functions reliably on multiple versions of Spark. The overall test architecture, as well as code examples for each component, will be presented in detail, and time will be left for discussion and questions.
Sean McIntyre is a software architect from Toronto, Canada and is currently leading several open source projects at Uncharted Software focusing on the analysis and visualization of large datasets. Most recently, his work has centered on utilizing Apache Spark as a dynamic, distributed execution engine behind large enterprise applications which require scalable ad-hoc access to big data.