This presentation reports our experience on using the machine learning techniques in Apache Spark ecosystem to understand the user behavior in a number of applications. In this context, Spark makes the vast computing power of a large high-performance computing system available to the behavioral economists without requiring the application scientists to learn about parallel computing. To illustrate the effectiveness of this approach, we focus on a compute-intensive task of establishing baseline for studying the impact of policies on consumer behavior. The gold standard for this type of baseline is a randomized control group, however, this control group can only provide a group-level reference, not for individual consumers. In many cases, the self-selection bias along with other factors can make it extremely difficult to generate a unbiased control group. By harnessing the computing power of Spark, we are able to learn the behavior pattern for each individual user and therefore create a much more precise baseline for behavioral analysis. We will use two use cases to illustrate the approach: a residential electricity usage study and a traffic pattern prediction study.
Dr. Wu works actively on a number of topics in data management, data analysis, and high-performance computing. His algorithmic research work includes bitmap indexing techniques for searching large datasets, statistical methods for extract features from a variety of data, and restarting strategies for computing extreme eigenvalues.He is the developer of a number of software packages, including, IDEALEM, SDS, FastBit and TRLan. Among them, the FastBit software for indexing large datasets has earned an R&D 100 Award, and is used by many organizations. For example, a German bioinformatics company uses FastBit to accelerate their molecular docking software by hundreds times.