Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists - Databricks

Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists

It is widely known that the discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market.

AstraZeneca is a global, innovation-driven biopharmaceutical business that focuses on the discovery, development, and commercialization of prescription medicines for some of the world’s most serious diseases. Our scientists have been able to improve our success rate over the past 5 years by moving to a data-driven approach (the “5R”) to help develop better drugs faster, choose the right treatment for a patient and run safer clinical trials.

However, our scientists are still unable to make these decisions with all of the available scientific information at their fingertips. Data is sparse across our company as well as external public databases, every new technology requires a different data processing pipeline and new data comes at an increasing pace. It is often repeated that a new scientific paper appears every 30 seconds, which makes it impossible for any individual expert to keep up-to-date with the pace of scientific discovery.

To help our scientists integrate all of this information and make targeted decisions, we have used Spark on Azure Databricks to build a knowledge graph of biological insights and facts. The graph powers a recommendation system which enables any AZ scientist to generate novel target hypotheses, for any disease, leveraging all of our data.

In this talk, I will describe the applications of our knowledge graph and focus on the Spark pipelines we built to quickly assemble and create projections of the graph from 100s of sources. I will also describe the NLP pipelines we have built – leveraging spacy, bioBERT or snorkel – to reliably extract meaningful relations between entities and add them to our knowledge graph.

« back
About Eliseo Papa

Eliseo is an MD and a computational biologist who earned his Ph.D. in biomedical engineering at the Harvard/MIT HST institute. At MIT, he developed new single-cell diagnostic tools and pioneered the use of machine learning algorithms to microbiome sequencing data. He has contributed to the founding of a number of life science startups (Seres, Enumeral, Finch therapeutics and working as a data scientist in ML and NLP primarily. He previously worked at Open Targets, a public-private initiative to generate evidence on the validity of therapeutic targets based on genome-scale experiments and analysis. He currently leads a data science team at AstraZeneca focussed on building large scale knowledge graphs and recommendation systems to identify new promising drug targets for AstraZeneca drug discovery pipeline and influence the design of ongoing oncology trials.