Chen Jin - Databricks

Chen Jin

Software Engineer, Uber

Chen has been a software engineer at Uber since 2015, where she has built several active services (such as streamio, eater-surge service). Prior to joining Uber, she worked on BigTable product at Palantir, an interactive large-scale data analytical platform used by several influential clients then. Before that, she spent 4 years in PhD program at Northwestern University and focused her researching on scaling data mining in big data.


A Scalable Hierarchical Clustering Algorithm Using Spark

Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, canoffer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon’s cloud environment. And a thorough performance evaluation in Amazon’s EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.

How Apache Spark and AI Powers UberEats

The overall relevance and health of the UberEATS marketplace is critical in order to make and maintain it an everyday product for Uber's users. In this session, Uber will share a few key design choices it made, such as how Apache Spark and AI are leveraged as an integral part of their production system to improve both the relevance and reliability of their recommender system and services. They will first dive into a few concrete use cases and lessons learned from building AI algorithms with Spark to improve the relevance of UberEats, such as how an multi-objective optimization framework is deployed with the recommender system to find a tradeoff between different business metrics. In addition, maintaining the marketplace's health is imperative for Uber to provide reliable service. So, in the second part of their talk, they will discuss a dynamic pricing framework that is designed to balance the demand and supply in real-time, in which Spark Streaming allows them to generate real-time features for their geospatial-temporal demand and supply forecasting models and proactively make pricing decisions to optimize market efficiency. Session hashtag: #SFds4