Reza Karimi

Data Scientist, Elsevier

Dr. Reza Karimi is currently a lead data scientist in Elsevier Search and Data Science Division. His work is focused on content modeling with deep learning, entity resolution, author disambiguation, and network analysis of research communities. Formerly, he was a research scientist and a project lead in Philips Research, where he worked on predictive maintenance of remote devices as well as healthcare productivity and quality analysis. He has a PhD in mechanical engineering from MIT with extensive experience in parallel processing of multi-dimensional images as well as statistical analysis and data mining of molecular trajectories during transport into nucleolus.

SESSIONS

Mentor and Mentee Relations Based on Authorship Graphs

Elsevier owns the Scopus data which is one of the biggest scientific abstract databases in the world. This corpus covers about 200 million authorships in 65 million abstracts going back a few centuries. We convert these authorships to disambiguated authors, where we know for each author all the corresponding publications/affiliations/co-authors. Authorships in this data can be used to obtain highly valuable insights such as finding influential authors or research communities/trends. Moreover, relationships among different authors such as being mentor/mentee, collaborator, etc can be detected based on co-authorship patterns. Here, we present a ML pipeline where spark components such as GraphFrames and Spark ML are combined to detect mentor-mentee relationships. We present libraries that extent graph functionalities of the spark and makes them seamlessly connected to D3 JavaScript for graph visualisation in web portals. We discuss a replacement of ML component (trained by a crowd sourced golden-data set) with a manually constructed heuristic model to evaluate gains due to a sophisticated ML training. Moreover, we discuss how trillions of transient co-authorship data can be converted to simplified aggregate features to be fed into a ML model for relationship detection among authors. The input aggregate data is based on disambiguated authorship data which by itself is another partially faulty input. We discuss how to build a robust model against this low fidelity input. This talk covers a healthy balance of rapidly developing Spark components such as ML and GraphFrames as well as deep technical details in graph analysis and ML training. Listeners can benefit from a great demonstration and additionally be introduced to new libraries extended visualisation and graph analysis. Our live demo can show instantaneous computation of academic ancestry and progeny of volunteers. As such, we also provide an example where spark will act as a back-end for a web application.

Deduplication and Author-Disambiguation of Streaming Records via Supervised Models Based on Content Encoders

Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by - Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier's Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. - We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. - Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases. Session hashtag: #EUai2