Dynamic Healthcare Dataset Generation, Curation, and Quality with PySpark – Databricks

Dynamic Healthcare Dataset Generation, Curation, and Quality with PySpark

Download Slides

Population health research involves carefully curated datasets for specific patient populations of interest. These datasets share many common elements, such as demographic and geographic information, but can have very domain-specific data points based on the application of interest. Creating these datasets at scale poses additional problems—largely related to metadata management and quality controls.

The data engineering team at Modernizing Medicine has addressed these challenges by creating an object-oriented dynamic dataset generation framework using Python and Spark. The scalability and performance of Spark allows terabytes of data to be extracted from numerous application servers and combined into Parquet files for analysis in clinical research. While Spark and Parquet solve the scalability challenge, the project leaves much to be lacking in the area of metadata management. The Python ecosystem helps in this regard, by allowing for object-oriented data manipulation and curation to be developed fairly rapidly. Spark DataFrame transformations and actions are inherited from parent classes to minimize duplicate code and logical errors. JSON files are used to define the schema of each dataset, and quality-control measures that must be passed for each column.

New quality checks can be added and maintained with familiar Python code through PySpark UDFs and Column operations. Additionally, data dictionaries are generated from these JSON definitions to educate consumers of the data and track changes. Aaron Richter, data scientist, will present an overview of this framework, and give a demo of how it is used in practice at Modernizing Medicine. Additionally, lessons learned from building this project in Spark and Python will be shared.

Session hashtag: #Py3SAIS

« back
About Aaron Richter

Aaron Richter is a data scientist at Modernizing Medicine and PhD candidate at Florida Atlantic University. He has pioneered the use of Apache Spark and big data technologies at Modernizing Medicine over the past 3 years for ETL, de-identification, and population health research. Aaron's PhD research is focused on data mining and machine learning for clinical decision support applications.