How do we encourage data scientists and software engineers to use Spark for interactive data analysis? How do we then streamline the process to deploy existing code as scheduled/streaming jobs in production? What training and implementation steps are required? What tradeoffs are necessary? At Uber, we’ve used Mesos, Docker, ElasticSearch, Jupyter/IPython Notebooks, Leaflet.js, and PySpark to maintain a multi-user platform for interactive data analysis and deployment. More details: 1. The old approach: dumping data from relational tables, sampling, and running ad-hoc scripts with Python and related libraries. Spark provides unified APIs to replace this approach. 2. How we organize our data, including Kafka streams, Parquet tables in Hive/HDFS, and traditional relational databases. 3. How our solution is implemented. How do users collaborate with Jupyter Notebooks, share resources in a Mesos cluster, and deploy/monitor their production jobs? How do users install third-party Python libraries? Use cases, including geospatial queries and machine learning with MLlib and spark-sklearn. 4. Discussion of key operational issues, e.g., Mesos fine-grained mode, Parquet partition sizes, number of partitions, executor memory, debugging failures. Reiterate that Spark is not a DB. 5. Mention alternatives, e.g., YARN, Scala, Zeppelin.
Dara is a software engineer at Uber, where he works on backend systems responsible for the realtime execution and online optimization of Uber's marketplace. He has previously worked for Synthicity, a developer of urban planning and real estate software solutions acquired by Autodesk, and received his Bachelor's in Civil Engineering from UC Berkeley.