Neelesh Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. He is helping build out the services that power the daily lives of Data Scientists.
May 27, 2021 03:50 PM PT
Apache Spark has been an integral part of Stitch Fix's compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse.
Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table.
These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business.
Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data.
And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table.
Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.