This presentation reports our experience on using the machine learning techniques in Apache Spark ecosystem to understand the user behavior in a number of applications. In this context, Spark makes the vast computing power of a large high-performance computing system available to the behavioral economists without requiring the application scientists to learn about parallel computing. To illustrate the effectiveness of this approach, we focus on a compute-intensive task of establishing baseline for studying the impact of policies on consumer behavior.
For behavioral analytics, the gold standard for this type of baseline is a randomized control group, however, this control group can only provide a group-level reference, not for individual consumers. In many cases, the self-selection bias along with other factors can make it extremely difficult to generate an unbiased control group. By harnessing the computing power of Spark, we are able to learn the behavior pattern for each individual user and therefore create a much more precise baseline for behavioral analysis. We will use two use cases to illustrate the approach: a residential electricity usage study and a traffic pattern prediction study.
Dr. Wu works actively on a number of topics in data management, data analysis, and high-performance computing. His algorithmic research work includes bitmap indexing techniques for searching large datasets, statistical methods for extract features from a variety of data, and restarting strategies for computing extreme eigenvalues.He is the developer of a number of software packages, including, IDEALEM, SDS, FastBit and TRLan. Among them, the FastBit software for indexing large datasets has earned an R&D 100 Award, and is used by many organizations. For example, a German bioinformatics company uses FastBit to accelerate their molecular docking software by hundreds times.