Yan Li - Databricks

Yan Li

Engineering Manager, Conviva

Yan is an Engineering Manager at Conviva, where she has been working on various projects in the area of Internet video streaming, and has built several products used by content publishers and ecosystem partners to monitor and optimize the experience of millions of end viewers. Yan holds a Ph.D. degree in Computer Science from the University of Connecticut.


Scale a Near Real-Time AI System by 4X and Beyond with Apache SparkSummit 2018

During last year's Spark Summit, we presented a near real-time spark based application for video streaming quality analysis. Building a new system is always sweet with a lot of fun, productizing it sometimes can be bitter. At our scale, it was worth spending development time to reduce machine requirements. We did this by: optimizing the number of splits; minimizing intra-job data shuffling; and customizing our SerDe. With these optimizations we were able to use 40% fewer machines. The Spark UI and other profiling tools helped guide us through this. - Spark is designed to be resilient to single failures by finishing computations eventually. For our near real-time system, "eventually" is not good enough. As we scaled our system, these single points of soft failure became evident. Spark's speculative execution feature helped us alleviate failures in worker nodes. - AI based system requires large volume of training data. Recovering model state from unexpected system outage may take hours, which is not affordable in near real-time use case. Checkpoint of Spark is a common solution to recover from worker node failure. In our application, we developed a solution to further utilize it to jump-start the whole application. - Spark UI provides only the most recent state of the cluster. For long-lasting applications, it is also necessary to monitor the system measurements over a relatively long period to identify performance issues. Using an internal monitoring tool, we were able to figure out a few system wide bottlenecks, such as database write congestion. Session hashtag: #ExpSAIS14


From Data to Actions and Insights at ConvivaSummit 2017

Video streaming is still a challenge, especially with increasing demand for high-quality streaming experiences. Problems can happen anywhere in the complex streaming ecosystem. At Convivia, they collect data about video streaming quality to give their customers (publishers and ecosystem partners) visibility into the end-user experience they’re delivering. Conviva's job is to distill these data into actionable insights or, better yet, to take automatic actions to improve quality. In this session, they will discuss two systems they have built to this end: - AutoAlert is a system for detecting and diagnosing anomalies in streaming quality in real time. Conviva uses Apache Spark to group millions of sessions according to a variety of criteria (e.g. the title of the streaming media, or the device the user is streaming from), and detect anomalies in several quality metrics over time. The detection task runs once per minute, with a detection latency of about 2 minutes. - GO is a system for directing users to the CDN that will provide the best quality. The architecture for this prediction task mirrors AutoAlert’s architecture. Our talk will focus on the outcomes we’ve achieved for our customers with these systems. Session hashtag: #SFent6