Shannon graduated from the joint Carnegie Mellon-University of Pittsburgh Ph.D. program in Computational Biology in November 2014, and is now an assistant professor at the University of Georgia with joint appointments in Computer Science and Cellular Biology. His research focuses on “big imaging,” data science and distributed computing to make sense of large biomedical image datasets. He’s a member of the ASF, specifically a Mahout committer, and his research group engages in “open science,” open-sourcing code and data alongside publications.
Cilia are microscopic hairs that line the exterior of cells throughout the body, including the lungs, brain, and kidneys, beating synchronously to clear mucus and foreign matter. Proper beating is essential for pulmonary health; numerous serious health complications arise when cilia beat abnormally or not at all. Diagnosing these afflictions of the cilia involves visually examining live biopsies, but such manual analyses are highly subjective and error-prone. Furthermore, subtle differences in ciliary motion could be indicative of unique underlying pathologies. Therefore, an unbiased, computational method for analyzing ciliary motion is clinically compelling. Here, we present a data-driven, unsupervised pipeline for creating a digital "library" of ciliary motion phenotypes and identifying distinct modes of motion. We describe the end-to-end PySpark pipeline of analyzing video data, decomposing motion into feature vectors for clustering, and identifying and validating discrete motion phenotypes. PySpark and its ecosystem play an essential role in addressing the challenges involved in isolating distinct patterns of ciliary motion from hundreds of gigabytes of raw data. We conclude with our future work in biomedical imaging and PySpark's dual role as a research tool and a potential framework for clinical and diagnostic assessment.