Bootstrapping of PySpark Models for Factorial A/B Tests

May 27, 2021 11:00 AM (PT)

Download Slides

A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.

We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.

To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.

In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.

In this session watch:
Ondrej Havlicek, Data Scientist, DataSentics

 

Ondrej Havlicek

Ondrej has background in computer science and cognitive science. During his academic career, he has conducted and analyzed data from behavioral and neuroimaging experiments. Since his moving to indust...
Read more