Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Session hashtag: #AISAIS10
Andrey is an Engineer on the Content team at Pinterest, focusing on modeling and infrastructure. Prior to that Andrey was ML Tech Lead at Sift Science building real-time fraud prediction. Before Sift, Andrey was a Lead Engineer at salesforce.com working on search and machine learning systems. Andrey enjoys machine learning, search, NLP, and distributed systems and holds CS degrees from Stanford University and University of Illinois.