Over the last year, we introduced Spark at several of our Dutch clients, including an airliner, media corporation and global online retailer. In this talk, we detail how we made a compelling case for Spark as the framework of choice, but also how we worked around some of its current limitations. The biggest advantage of Spark for our customers was the short time (6-8 weeks) needed to move from initial analyses to A/B testing phase, thus quickly showing that deployments of our solutions could lead to concrete, measurable value. Next to that, some important success factors helped to ensure Spark’s adoption. Very important was the ability to use existing R or Python/Pandas skills, like working with dataframes. Added to that, we could adapt prototypes in either language to scale out beyond the analysts’ laptops using PySpark and SparkR. And for clients who already invested Hadoop infrastructure, another benefit was that we immediately could use data sets available on the platform, without additional investments beyond installing and configuring Spark. Of course, there also were some challenges. For one, the current MLLib implementation is not as complete as for example CRAN, limiting choices for deployment based on MLLib. At one point we need to implement a custom cosine similarity metric for an item-item collaborative filtering solution, which we intend to publish as a package. Also, we noticed that configuring a cluster with Spark is not always trivial, which may result in slowly progressing jobs or frequent out of memory errors. In summary, Spark stands out as a scalable solution for data scientists and engineers with diverse skills, who can quickly pick up basic concepts and start to be productive. And with its fast pace of development, we have confidence that this will remain so for the coming time!
Renald is GoDataDriven's COO, focused on the development of a fantastic team and business proposition. Renald studied Computer Science, focusing on ML and NLP. After that, he spent over a decade as researcher at the Leiden University, where he obtained a PhD. He has been active in Big Data and Data Science since 2010, both in government and commercial consultant positions. While studying in Twente, it became clear that software was his future when during a Laboratory Class he blew up a transistor in just under a minute, only to find out after over thirty minutes of futile measuring attempts.