Ondrej has background in computer science and cognitive science. During his academic career, he has conducted and analyzed data from behavioral and neuroimaging experiments. Since his moving to industry, he has been developing ML solutions for various clients, eg. in ecommerce and social media marketing, such as AB testing or recommendation engines. Ondrej likes fiddling with statistical and ML models, building and optimizing (Py)Spark pipelines.
May 27, 2021 11:00 AM PT
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we've implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.