Josh Snyder is global technical and product lead for high-content screening data management and analytics at the Novartis Institutes for BioMedical Research. He has thirteen years of experience designing software to advance healthcare research, particularly in the clinical imaging domain.
Drug discovery research projects typically start with an assay: a model system developed for testing hypotheses about the biology of a disease. The assay is used to identify chemical or biological samples which could potentially have a positive therapeutic effect. Promising samples, or "hits," are selected for additional research activities with the ultimate goal of discovering safe and effective new therapies for the clinic. One common approach to finding hits is by screening an entire library of samples against the assay. This is bench science at scale, requiring laboratory automation to process up to millions of samples for a single assay. Advances in bio-sciences and technologies are rapidly providing increasingly sophisticated assay models with high-content readouts. For example, every sample tested might influence an independent culture of thousands of cells. Every cell might be imaged and measured to derive feature vectors with thousands of dimensions. A single screening assay used today might thus produce trillions of data points, and require complex analytics for hit finding: normalization against controls, reduction of highly correlated features, multi-parametric classification, and more. We present a generalizable platform which leverages Spark and complementary technologies for distributed analytics and interactive visualization of large, high-dimensional screening data. We will provide an overview of the system, a sense of the data scale, a discussion of our design for both batch and interactive use cases, and detail on the computational methods we've implemented using Spark.