Consumer apps like Yelp generate log data at huge scale, and often this is distributed according to a power law, where a small number of users, businesses, locations, or pages are associated with a disproportionately large amount of data. This kind of data skew can cause problems for distributed algorithms, especially joins, where all the rows with the same key must be processed by the same executor. Even just a single over-represented entity can cause a whole job to slow down or fail. One approach to this problem is to remove outliers before joining, and this might be fine when training a machine learning model, but sometimes you need to retain all the data. Thankfully, there are a few tricks you can use to counteract the negative effects of skew while joining, by artificially redistributing data across more machines. This talk will walk through some of them, with code examples.
I'm a data scientist and machine learning engineer based in London. I work on Yelp's Core Machine Learning team, building tooling and infrastructure to support data mining teams around the company. Previously I've worked at Etsy, Pearson, Last.fm, AstraZeneca, and the University of London. My academic background is in bioinformatics and natural language processing.