Qing Zhang is a member of the eBay Structured Data Research team. Her field is natural language processing, machine learning on large-scale user generated data in online marketplaces. Prior to eBay, she was a research assistant at Quantitative Health Science Department at the University of Massachusetts Medical School and had been extensively working on link prediction on both social network and protein-protein interaction networks. She received her PhD. in computer science from the University of Wisconsin Milwaukee in 2014, majoring in machine learning and complex graph analysis.
We report our work on metadata discovery systems on the eBay Structured Data team. An important mission of our team is to dynamically discover new metadata such as brand, style, model from listings. These data are crucial for our site as they'll be used for enhancing functionalities such as search, filtering on the site, and other internal data management requirements. We use supervised machine learning to identify key metadata, such as brand. There are 6 million new listings created on eBay every day, so the flexibility and scalability are extremely important in addition to the precision/recall. Current machine learning workflow has been increasingly insufficient. First the turnaround time between prototyping (python, R scripts on small data) and production (Java) is slow. Second, we have very limited ML algorithm support in current production platforms, and new algorithm implementation is often ad hoc. The eBay new machine learning system is fully based on Spark, which provides great productivity and scalability. The system consists of feature generation, training, and prediction. The feature generation component preprocesses the raw data and transforms them into vectors. Subsequently, models are trained and evaluated by training component, which is fully implemented with spark.ml pipeline. eBay takes advantage of cross validation and grid search to find the best model. The prediction component then identifies new valid meta data with the model, and it processes 400 million entries per minute. The prototyping process has been seamlessly integrated into above pipeline as we can simply try out different kinds of algorithms and hyper parameters. There are several takeaways. First, Spark machine learning provides a rich selection of algorithms as well as parameter tuning pipeline support. Second, the prototyping and production development become one single system with minimal external scripting.