Locality Sensitive Hashing By Spark - Databricks

Locality Sensitive Hashing By Spark

Download Slides

Locality Sensitive Hashing (LSH) is a randomized algorithm for solving Near Neighbor Search problem in high dimensional spaces. LSH has many applications in the areas such as machine learning and information retrieval. In this talk, we will discuss why and how we use LSH at Uber. Then, we will dive deep into the technical details of our LSH implementation. Our LSH library is designed and implemented to optimize the performance on Spark. It supports pluggable distance functions. Out of the box, Jaccard, Cosine, Hamming and Euclidean distance functions are included in the library. It also supports approximate near neighbor searches and self-similarity joins. In the talk, we will also share performance benchmark and our experience of running LSH on Spark in production clusters.

Learn more:

  • Detecting Abuse at Scale: Locality Sensitive Hashing at Uber Engineering