I am the Product Manager for open source efforts at Databricks. Prior experience includes Spark and Data Science Architect at Hortonworks, Principal Research Scientist at Yahoo focused on large scale data mining and machine learning for search and display advertising. I am an Apache Spark PMC Member and Committer.
Next generation sequencing is becoming cheaper and more accessible. The volume of data sequenced is increasing faster than Moore’s Law. However, it is still expensive and slow to go from raw reads to variant calls, and to produce annotated variants that can then be analyzed downstream. In this talk, we will discuss the first state of the art, scalable and simple DNA sequencing workflow that is built on top of Apache Spark and the Databricks APIs. The pipeline is simple to set up, is easy to scale out, and can sequence a 30x coverage genome cost efficiently on the cloud. We will introduce the problem of alignment and variant calling on whole genomes, discuss the challenges of building a simple yet scalable pipeline and demonstrate our solution. This talk should be of interest to developers wishing to build ETL pipelines on top of Apache Spark, as well as biochemists and molecular biologists who wish to learn how to develop cheap and fast DNA sequencing pipelines.
Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance in search and targeted advertising and an important variable in many predictive analytics applications. For example when a user searches for "canyon hotels", without location awareness the top result or sponsored ads might be for hotels in the town "Canyon, TX". However, if they are are near the Grand Canyon, the top results or ads should be for nearby hotels. Thus a search term combined with location context allows for much more relevant results and ads. Similarly a variety of other predictive analytics problems can leverage location as a context. To leverage spatial context in a predictive analytics application requires us to be able to parse these datasets at scale, join them with target datasets that contain point in space information, and answer geometrical queries efficiently. In this talk, we describe the motivation and the internals of an open source library that we are building for Geospatial Analytics using Spark SQL, DataFrames and Catalyst as the underlying engine. We outline how we leverage Catalyst's pluggable optimizer to efficiently execute spatial joins, how SparkSQL's powerful operators allow us to express geometric queries in a natural DSL, and discuss some of the geometric algorithms that we implemented in the library. We also describe the Python bindings that we expose, leveraging Pyspark's Python integration.
Suppose you have a large volume of point in space data (think mobile GPS coordinates). You want to join this dataset with shapes (be it neighborhoods in New York boroughs, the road system in NYC, the canal systems in Amsterdam, what have you). How do you do this join at scale? The lack of spatial join implementations in open source geospatial analytics libraries is one of the biggest impediments to leveraging geospatial context for rich predictive analytics, and our goal in this talk is to show how we are solving this problem using Magellan and Spark. Magellan is a newly open sourced geospatial analytics engine written on top of Spark and is the first such engine to deeply leverage Spark SQL, Dataframes and Catalyst to provide very efficient spatial analytics on top of Spark. In this talk we will focus on one specific aspect of Magellan, which is, how does Magellan implement Spatial Joins, and where does it leverage Spark SQL to do this efficiently and transparently? The talk should be of interest to developers who wish to understand how to leverage Spark SQL in richer ways than before, those interested in writing specialized analytics engines on top of Spark SQL, and Data Scientists and Data Engineers who wish to perform spatial analytics processing or predictive analytics on geospatial datasets at scale.
Most traditional applications of Spark involve massive data-sets that already exist. A less-commonly encountered use-case, but nevertheless extremely useful, is in Simulations, where massive amounts of data are generated based on model parameters. In this talk we explore some of the challenges that arise in setting up scalable simulations in a specific application, and share some of our solutions and lessons learned along the way, in the realms of mathematics and programming. The application scenario we explore is to quantify the impact of cookie-contamination in randomized experiments aimed at measuring digital advertisement lift/effectiveness. Cookies are randomly assigned to test or control, and those in test are exposed to ads while those in control are not. The goal is to measure the lift in conversion-rate due to ad-exposure. One important factor that taints such measurements is cookie-contamination: a real-world user may have multiple cookies (but the system is unaware of this linkage), and if their cookies are in both test and control groups, then the cookie in control may show a higher conversion rate than that of a clean control cookie that has no "siblings" in the test group. Analytically quantifying the impact of this contamination is difficult without making overly simplistic assumptions, and one idea we pursued is to simulate the impact of cookie-contamination, with millions of trials over 10s of millions of users. The goals are: (a) understand/quantify the impact of cookie distribution and contamination, on the expected value of the computed lift as well as the 90% confidence interval, and (b) derive approximate analytical formulas for the observed lift. Scaling up the simulations to a large of trials and users is challenging, and we share some of our solutions, and also describe the analysis of error and expectation.
Structured Streaming is a new API in Spark 2.0 that simplifies the end to end development of continuous applications. One such continuous application is online model updates: Online models are incrementally updated with new data and can be continuously queried while being updated. As a result, they can be fast to train and leverage new data faster than offline algorithms. In this talk, we give a brief introduction the area of online learning and describe how online model updates can be built using structured streaming APIs. The end result is a robust pipeline for updating models that is scalable, fast and fault tolerant.