Egor is a Spark contributor and Senior Software Engineer in AirBnB where he works on infrastructure to simplify creating and managing Spark pipelines. Before joining Airbnb, he worked in Apple on configurable, high-load streaming and batch pipelines. Egor led the engineering team in Anchorfree responsible for a data solution on top of Hadoop. This solution included in-house DSL for defining DAGs of Spark jobs, Apache Zeppelin, Impala, Tableau. Egor has been working with Apache Spark since version 0.9.
June 24, 2020 05:00 PM PT
Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode. Such flexibility is great if you are experienced Spark developer solving a complicated data engineering problem, which might include ML or streaming. In Airbnb, 95% of all data pipelines are daily batch jobs, which read from Hive tables and write to Hive tables. For such jobs, you would like to trade some flexibility for more extensive functionality around writing to Hive or multiple days processing orchestration. Another advantage of reducing flexibility is creating 'best practices', which can be followed by less experienced data engineers.
In AirBnB we've created a framework called 'Sputnik', which tries to address these issues. Data engineers need to extend the sputnik base class and write code for data transformation without bothering about the filtering of dates for which the job would run. End users do not read or write to Hive directly, they use Sputnik wrappers for Hive. Read wrapper filters input data based on parameters from the console including the time frame. Write wrapper get information about result table from case class annotations, writes meta-information about the table, makes verifications on the data and much more. The core idea of the framework is that all functionality of the job consists of job-specific logic and run-specific logic. Job specific logic is a transformation defined by data engineer and meta information about the tables. Run specific logic is filtering input data based on current date and writing data to Hive. Data Engineer needs to specify job-specific logic, and Sputnik handles all run specific logic based on assumptions about the right way of operating daily Hive batch jobs. https://github.com/airbnb/sputnik