Data scientist at Stitch Fix, mainly working on inventory analysis. Former data engineer at Shazam building internal and external data products on massive amount of user data.
Data scientists write SQL queries everyday. Very often they know how to write correct queries but don't know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene.In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark. Specifically: - Why and how do we migrate? - How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift - How do we scale the team on Spark (with 80+ people in our data science team)