customer story
How AI is changing drug discovery

AstraZeneca leverages data and NLP to help scientists research novel drugs

INDUSTRY: Life sciences

SOLUTION: Recommendation engine

TECHNICAL USE CASE: Data ingest and ETL, machine learning, deep learning

AstraZeneca discovers, develops, and commercializes groundbreaking drugs for some of the world’s most serious diseases. The biggest obstacle to new innovations is the inability to tap into all of the scientific information available to them faster than the pace of new data coming in. They needed a platform that enabled them to build scalable, performant data pipelines that feed machine learning models designed to help their scientists make targeted decisions. With Databricks, they are able to leverage data and machine learning to build a recommendation engine that empowers scientists to more easily uncover new novel drugs quicker, cheaper and more effectively.

Too much data slows decision making

It is widely known that the discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. Understanding that this pace of innovation is not sufficient, AstraZeneca moved to a data-driven approach in order to increase their success rate for drug discovery and safer management of clinical trials.

However, their scientists were still unable to quickly make informed decisions with all of the available scientific information at their fingertips. They struggled with data residing in disjointed sources both within the company as well as external public databases. Furthermore, as new scientific research continues to be released at a rapid pace, it became virtually impossible to keep up-to-date with the pace of scientific discovery.

  • Infrastructure complexity: Finding the infrastructure that allowed for flexibility but didn’t require constant maintenance.
  • Massive volumes of disjointed data: Tasked with ingesting, parsing, and analyzing millions of data points across 100s of data sources including internal data sources and public sources including technical literature, public databases, etc.
  • Struggled to scale operations to support data science efforts with open source Python notebooks.

Faster data pipelines fuel ML innovation

AstraZeneca leverages Databricks Unified Data Analytics Platform on Azure to help build a knowledge graph of biological insights and facts. The graph powers a recommendation system which enables any AstraZeneca scientist to generate novel target hypotheses, for any disease, leveraging all of the data available to them.

  • Fully managed platform: Simplified cluster management and maintenance of analytic resources at scale on Azure.
  • Built scale, performant data pipelines: Able to leverage NLP across a huge library of scientific literature and data sources for downstream analysis.
  • Accelerating machine learning innovation: Data scientists are empowered to build and train models that provide ranking predictions that will help them make smarter decisions.

Transforming how drugs are discovered with AI

Since moving to Databricks, AstraZeneca is now able to more easily process millions of data points from thousands of sources. Removing the barriers of scale has allowed them to more reliably extract meaningful insights that can result in novel drugs designed to help people live healthier lives.

  • Improved operational efficiency: Features such as cluster management and auto-scaling of clusters has improved operations from data ingest to managing the entire machine learning lifecycle.
  • Better data science productivity: Shared notebook environment with support for multiple languages has improved team productivity.
  • Faster time-to-insight: The recommendation engine powered by Databricks has improved their ability to make more informed hypotheses, allowing them to accelerate time-to-market for novel drugs and medicines.
  • Millions
    Of data points processed from thousands of sources

By moving to Databricks, we have seen an order of magnitude improvement in performance.”

– Eliseo Papa, Computational Biologist, AstraZeneca

Related Content


Technical Talk at Spark + AI Summit EU 2019