Emilie de Longueau is a Senior Software Engineer (Machine Learning) in the Communities AI team at LinkedIn, focused on driving member engagement through personalized and scalable Follow Recommendations for hundreds of millions of members. She has 5 years of industry experience in Data Science/Machine Learning and building Big Data solutions and algorithms using Spark. Emilie holds Master’s degrees in Industrial Engineering and Operational Research, from the University of California (Berkeley) and Ecole des Ponts ParisTech (Paris). Her expertise in Apache Spark has helped her team modernize its offline scoring infrastructure to improve scalability and relevance of Follow Recommendations.
The Communities AI team at LinkedIn generates follow recommendations from a large (10's of millions) set of entities to each of our 650+ million members. These recommendations are driven by ML models that rely on three sets of features (member, entity, and interaction features). In order to support a fast-growing user base, an expanding set of recommendable entities (members, companies, hashtags, groups, newsletters etc.) and more sophisticated modeling approaches, we have re-engineered the system to allow for efficient offline scoring in Spark.Â In particular, we have handled the 'explosive' growth of data by developing a 2D Hash-Partitioned Join algorithm that optimizes the join of hundreds of terabytes of features without requiring significant data shuffling. In addition to a 5X runtime performance gain, this opened the opportunity for training and scoring with a suite of non-linear models like XGBoost, which improved the global follow rate on the platform by 15% and downstream engagement on LinkedIn feed from followed entities by 10%.
Building a practical and manageable Data Lake seems to be a simple undertaking in the beginning, especially looking at the plethora of purpose fit solutions and products available today. However as we undertake this journey we notice several key questions come up. Data Archival, orchestration and dynamic data pipelines, quality and completeness controls, audit and segregation of duties, batch, micro batch and event based ingestion, keeping operational costs under control, choosing appropriate cloud solutions etc.
Most of us perceive the use of data lake as a platform to process large files, however most of the finance and insurance companies tend to have a very unique challenge - dealing with ingesting and integrating thousands of small files each day. Interestingly most of today's distributed and Spark computing methodologies needs to be customized and tailored to meet this unique need. All of these variables tend to make the data lake solution over engineered and complicated eventually contradicting to the principals of a data lake design and making it unmanageable. I would love to share with you how, in Transamerica, we were able to leverage Data Bricks Delta lake & Azure Data Factory to simplify the ingestion and versioning of thousands of files each day at scale without compromising on controls, audit and data compliance requirements.