We’re all here because we understand the potential of Spark for heavy-weight distributed processing. But how does one migrate an 8-years-old, single-server, MySQL-based legacy system to such new shiny frameworks? How do you accurately preserve the behavior of a system consuming Gigabytes of data every day, hiding numerous undocumented implicit gotchas and changing constantly, while shifting to brand new development paradigms? In this talk I’ll present Kenshoo’s attempt at this challenge, where we migrated a legacy aggregation system to Spark. Our solutions include heavy usage of metrics and graphite for analyzing production data; “local-mode” client enabling reuse of legacy tests suits; data validations using side-by-side execution; and maximum reuse of code through refactoring and composition. Some of these solutions use Spark-specific characteristics and features. This talk is especially useful for developers and managers trying to understand how to “make the jump” into the distributed world with their existing systems, without breaking their user’s data. We believe that the set of dilemmas and solutions we’ve encountered is common to many legacy production systems, and therefore might help mitigate some of the risk of making such a change, eventually getting more people to join Spark’s user community.
Architect and developer for 10 years. Joined the Kenshoo team at the very beginning, playing various roles (developer, dev team lead, chief architect and architect), nowadays focused on scaling out our enterprise solutions, from whiteboard brainstorming to hands-on coding. Mostly Java oriented, but shifting to Scala and loving it.
Lead Software Developer with 6 years of experience in the IDF and over 2 years at Kenshoo. I do most of my coding in Java and Scala, but I'd take Scala any day of the week. Would love to share my two cents from our experience with Spark.