Lantao Jin

Software Engineer, eBay

Lantao Jin is a software engineer at eBay’s Infrastructure Data Engineering (INDE) group, focusing on Spark optimization and efficient platform building. He is a contributor of Apache Spark and Apache Hadoop and familiar with a variety of distributed systems. Prior to eBay, He worked for Meituan-Dianping and Alibaba, where he worked on data platform and data warehouse infrastructure efforts.

Past sessions

The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.

Traditional commercial databases with ACID capabilities have basically implemented complete CRUD operations. In the big data demain, due to the lack of ACID support, only C(reate) R(ead) operations are implemented, and U(update) D(elete) operations are rarely involved. Part of the infrastructure of the eBay data warehouse is built on the commercial data product Teradata. In recent years, as the company's overall technology migrated to open source solutions, the data warehouse infrastructure has basically migrated to Apache Hadoop and Apache Spark platforms. But to completely migrate from Teradata, you must build a SQL processing engine with the same capabilities. Analytical SQL on Teradata has 5% of queries using Update / Delete operations. Currently Apache Spark does not have this capability.

This session mainly introduces the eBay Carmel team using Delta Lake to transform legacy Apache Spark to fully support Teradata's Update / Delete syntax. Besides the standard SQL Update / Delete which provided by Apache Spark 3.0 (At present, it hasn't released yet), we have also implemented Teradata's extended syntax in Apache Spark 2.3, which can perform more complex Update / Delete SQL operations, such as join. In this session, I will introduce how we did it and its technical details.

Speaker: Lantao Jin

Summit 2019 Managing Apache Spark Workload and Automatic Optimizing

April 23, 2019 05:00 PM PT

eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center.

Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection.

Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.