Divide and you will conquer Apache Spark. It’s quite common to develop a papyrus script where people try to initialize spark, read paths, execute all the logic and write the result. Even, we found scripts where all the spark transformations are done in a simple method with tones of lines. That means the code is difficult to test, to maintain and to read. Well, that means bad code.
We built a set of tools and libraries that allows developers to develop their pipelines by joining all the Pieces. These pieces are compressed by Readers, Writers, Transformers, Aliases, etc. Moreover, it comes with enriched SparkSuites using the Spark-testing-base from Holden Karau. Recently, we start using junit4git in our tests, allowing us to execute only the Spark tests that matter by skipping tests that are not affected by latest code changes.
This translates into faster builds and fewer coffees. By allowing developers to define each piece on its own, we enable to test small pieces before having the full set of them together. Also, it allows to re-use code in multiple pipelines and speed up their development by improving the quality of the code. The power of “Transform” method combined with Currying, creates a powerful tool that allows fragmenting all the Spark logic.
This talk is oriented to developers that are being introduced in the Spark world and how developing iteration by iteration in small steps could help them in producing great code with less effort.
Albert Franzi is a Software Engineer who fell so in love with data that ended up as Data Engineer Lead in the Data Platform Team at Typeform. He believes in a world where distributed organizations can work together to build common and reusable tools to empower their users and projects. Albert deeply cares about unified data and models as well as data quality and enrichment.
He also has a secret plan to conquer the world with data, insights, and penguins.