Burning Through Electronic Health Records in Real Time With Smolder
In previous blogs, we looked at two separate workflows for working with patient data coming out of an electronic health record (EHR). In those workflows, we focused on a historical batch extract of EHR data. However, in the real world, data is continuously inputted into an EHR. For many of the important predictive healthcare analytics...
Detecting At-risk Patients with Real World Data
With the rise of low cost genome sequencing and AI-enabled medical imaging, there has been substantial interest in precision medicine. In precision medicine, we aim to use data and AI to come up with the best treatment for a disease. While precision medicine has improved outcomes for patients diagnosed with rare diseases and cancers, precision...
Introducing GlowGR: An industrial-scale, ultra-fast and sensitive method for genetic association studies
Today, we announce that we are making a new whole genome regression method available to the open source bioinformatics community as part of Project Glow. Large cohorts of individuals with paired clinical and genome sequence data enable unprecedented insight into human disease biology. Population studies such as the UK Biobank, Genomics England, or Genome Asia...
Building a Modern Clinical Health Data Lake with Delta Lake
The healthcare industry is one of the biggest producers of data. In fact, the average healthcare organization is sitting on nearly 9 petabytes of medical data. The rise of electronic health records (EHR), digital medical imagery, and wearables are contributing to this data explosion. For example, an EHR system at a large provider can catalogue...
Automating Digital Pathology Image Analysis with Machine Learning on Databricks
Join our webinar Automating the Analysis of Digital Pathology Images with Deep Learning to learn more and see a live demo. With technological advancements in imaging and the availability of new efficient computational tools, digital pathology has taken center stage in both research and diagnostic settings. Whole Slide Imaging (WSI) has been at the center...
Introducing Glow: An Open-Source Toolkit for Large-Scale Genomic Analysis
The key to solving some of today’s most challenging medical problems lies in the analysis of genomics data. Understanding the impact of the minor changes in an individual’s genome on their overall health is fundamentally a data driven challenge that requires integration across hundreds of thousands of individuals. By analyzing genomes across large cohorts, researchers...
Parallelizing SAIGE Across Hundreds of Cores
As population genetics datasets grow exponentially, it is becoming impractical to work with genetic data without leveraging Apache Spark™. There are many ways to use Spark to derive novel insights into the role of genetic variation on disease processes. For example, Regeneron works directly on Spark SQL DataFrames, and the open-source Hail package can be...
Engineering population scale Genome-Wide Association Studies with Apache Spark™, Delta Lake, and MLflow
The advent of genome-wide association studies (GWAS) in the late 2000s enabled scientists to begin to understand the causes of complex diseases such as diabetes and Crohn’s disease at their most fundamental level. However, academic bioinformatics tools to perform GWAS have not kept pace with the growth of genomic data, which has been doubling globally...
Monitor Medical Device Data with Machine Learning using Delta Lake, Keras and MLflow: On-Demand Webinar and FAQs now available!
On August 20th, our team hosted a live webinar—Automated Monitoring of Medical Device Data with Data Science—with Frank Austin Nothaft, PhD, Technical Director of Healthcare and Life Sciences, and Michael Ortega, Senior Industry and Solutions Marketing Manager. By applying machine learning to medical device data, healthcare organizations can automate patient monitoring, reduce repair costs with...
Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL
This is the second post in our “Genomic Analysis at Scale” series. In our first post, we explored a simple problem: how to provide real-time aggregates when sequencing large volumes of genomes. We solved this problem by using Delta Lake and a streaming pipeline built using Spark SQL. In this blog, we focus on the more advanced...