On August 20th, our team hosted a live webinar—Automated Monitoring of Medical Device Data with Data Science—with Frank Austin Nothaft, PhD, Technical Director of Healthcare and Life Sciences, and Michael Ortega, Senior Industry and Solutions Marketing Manager.
By applying machine learning to medical device data, healthcare organizations can automate patient monitoring, reduce repair costs with preventative maintenance, and gather new insights on patient health outside of a clinical setting. However, most healthcare organizations attempting to build ML pipelines on these large datasets face numerous challenges such as scaling legacy infrastructure, building reliable streaming pipelines and developing models efficiently. In this webinar, we shared how to overcome these challenges with Databricks and popular open-source technologies including a live demo of a deep learning model on streaming medical device data.
Watch the replay to learn how to:
- Build a streaming pipeline for EKG data using Structured Streaming and Delta Lake
- Improve data consistency guarantees while eliminating data engineering bottlenecks
- Interactively query streaming EKG data in real-time
- Rapidly train a deep learning model over terabytes of waveforms
- Track and manage the entire model lifecycle in MLFlow, allowing analysis traceability
We demonstrated these concepts using these notebooks and tutorials:
- Notebook: Download and preprocess data
- Notebook: Train and tune a neural network
- Notebook: Create a streaming dataset
- Notebook: Run inference on continuously arriving data
Toward the end, we held a Q&A and below are the questions and answers.
Q: How exactly does WFDB help in this use case? Is the WFDB data stored within Databricks or on a different server?
WFDB is a standard file format for exchanging biomedical waveform data. In this example, WFDB is the interchange file format that the EKG data arrived in, and the data is stored in the Databricks File System (DBFS). DBFS is a thin layer to manage metadata about data stored in the customers’ Azure Blob Storage on Azure Databricks or S3 on Databricks on AWS. In the workflow we demonstrated, we start by transforming the data from WFDB into a Delta Lake table.
Q: How do you determine the window size? And does the size affect performance?
We based our choice of the window size of 2,048 samples off of a recent blog analyzing this dataset. Intuitively, 2,048 samples is approximately 2 heartbeats at the sampling rate used in this dataset.
Q: Was any signal processing done on the data before ingestion?
In this example, we used data collected from the open-access PTB Diagnostic ECG database. Limited signal processing was performed when the data was acquired. We did not perform additional signal processing after downloading the data.
Q: Does Databricks provide auto-keras support?
auto-keras is a python library for automating neural network model architecture optimization using the Keras deep learning library, which is preinstalled into the Databricks ML Runtime (AWS | Azure). auto-keras can be installed using Databricks library management features (AWS | Azure) and used on a Databricks cluster. Beyond auto-keras, we support a wide range of AutoML capabilities, which we covered in a recent blog.
Q: How do you monitor the clusters and where can I see the metrics for each job?
The Spark UI is displayed both inline within notebooks when a Spark job is running, and can be accessed on the Databricks cluster UI (AWS | Azure). Additionally, we make available a large range of metrics, such as the metrics output by Ganglia (AWS | Azure).
- Watch the webinar replay to learn more
- Start exploring our deep learning pipeline for medical device data with notebooks from our webinar:
- Get started with a free trial of Databricks