Mike Dias - Databricks

Mike Dias

Data Engineer, Atlassian


Building Understanding Out of Incomplete and Biased Datasets using Machine Learning and DatabricksSummit 2020

At Atlassian, product analytics exists to help our teams build better products by capturing and describing in-product behaviour. Within our on-premise products, only a subset of customers choose to send us anonymised event data, meaning we have an incomplete and biased dataset. In this world, something as simple as 'what percentage of customers use feature X' then becomes a non-trivial estimation task. This world becomes further complex when a metric is subadditive, such as monthly active users, where one user active on multiple (and possibly unknown) instances should be counted as only one user, and our estimation needs to account for this. In this talk, we'll dive into our estimation methods and adjustments we make for various metrics, providing an accessible guide to operating in this environment. We'll also discuss how we democratised these estimation methods, allowing any stakeholder who can write a query to immediately access our models and create accurate and consistent estimates.