Mentor and Mentee Relations Based on Authorship Graphs

Download Slides

Elsevier owns the Scopus data which is one of the biggest scientific abstract databases in the world. This corpus covers about 200 million authorships in 65 million abstracts going back a few centuries. We convert these authorships to disambiguated authors, where we know for each author all the corresponding publications/affiliations/co-authors. Authorships in this data can be used to obtain highly valuable insights such as finding influential authors or research communities/trends. Moreover, relationships among different authors such as being mentor/mentee, collaborator, etc can be detected based on co-authorship patterns. Here, we present a ML pipeline where spark components such as GraphFrames and Spark ML are combined to detect mentor-mentee relationships. We present libraries that extent graph functionalities of the spark and makes them seamlessly connected to D3 JavaScript for graph visualisation in web portals. We discuss a replacement of ML component (trained by a crowd sourced golden-data set) with a manually constructed heuristic model to evaluate gains due to a sophisticated ML training. Moreover, we discuss how trillions of transient co-authorship data can be converted to simplified aggregate features to be fed into a ML model for relationship detection among authors. The input aggregate data is based on disambiguated authorship data which by itself is another partially faulty input. We discuss how to build a robust model against this low fidelity input. This talk covers a healthy balance of rapidly developing Spark components such as ML and GraphFrames as well as deep technical details in graph analysis and ML training. Listeners can benefit from a great demonstration and additionally be introduced to new libraries extended visualisation and graph analysis. Our live demo can show instantaneous computation of academic ancestry and progeny of volunteers. As such, we also provide an example where spark will act as a back-end for a web application.

About Reza Karimi

Dr. Reza Karimi is currently a lead data scientist in Elsevier Search and Data Science Division. His work is focused on content modeling with deep learning, entity resolution, author disambiguation, and network analysis of research communities. Formerly, he was a research scientist and a project lead in Philips Research, where he worked on predictive maintenance of remote devices as well as healthcare productivity and quality analysis. He has a PhD in mechanical engineering from MIT with extensive experience in parallel processing of multi-dimensional images as well as statistical analysis and data mining of molecular trajectories during transport into nucleolus.