Victor Hong is a senior application architect in the data engineering team at the Novartis Institutes of BioMedical Research, where he contributes to multiple big data projects in various roles, including architecture, data modeling, and application development. Victor previously worked at Nokia as a principle software engineer, building data analytics systems for mobile phone usage. Victor holds master’s degrees in computer science and geophysics from the University of Illinois at Chicago, and a bachelor’s degree from Fudan University.
Drug discovery research projects typically start with an assay: a model system developed for testing hypotheses about the biology of a disease. The assay is used to identify chemical or biological samples which could potentially have a positive therapeutic effect. Promising samples, or "hits," are selected for additional research activities with the ultimate goal of discovering safe and effective new therapies for the clinic. One common approach to finding hits is by screening an entire library of samples against the assay. This is bench science at scale, requiring laboratory automation to process up to millions of samples for a single assay. Advances in bio-sciences and technologies are rapidly providing increasingly sophisticated assay models with high-content readouts. For example, every sample tested might influence an independent culture of thousands of cells. Every cell might be imaged and measured to derive feature vectors with thousands of dimensions. A single screening assay used today might thus produce trillions of data points, and require complex analytics for hit finding: normalization against controls, reduction of highly correlated features, multi-parametric classification, and more. We present a generalizable platform which leverages Spark and complementary technologies for distributed analytics and interactive visualization of large, high-dimensional screening data. We will provide an overview of the system, a sense of the data scale, a discussion of our design for both batch and interactive use cases, and detail on the computational methods we've implemented using Spark.