Spark: Interactive To Production

Download Slides

How do we encourage data scientists and software engineers to use Spark for interactive data analysis? How do we then streamline the process to deploy existing code as scheduled/streaming jobs in production? What training and implementation steps are required? What tradeoffs are necessary? At Uber, we’ve used Mesos, Docker, ElasticSearch, Jupyter/IPython Notebooks, Leaflet.js, and PySpark to maintain a multi-user platform for interactive data analysis and deployment. More details: 1. The old approach: dumping data from relational tables, sampling, and running ad-hoc scripts with Python and related libraries. Spark provides unified APIs to replace this approach. 2. How we organize our data, including Kafka streams, Parquet tables in Hive/HDFS, and traditional relational databases. 3. How our solution is implemented. How do users collaborate with Jupyter Notebooks, share resources in a Mesos cluster, and deploy/monitor their production jobs? How do users install third-party Python libraries? Use cases, including geospatial queries and machine learning with MLlib and spark-sklearn. 4. Discussion of key operational issues, e.g., Mesos fine-grained mode, Parquet partition sizes, number of partitions, executor memory, debugging failures. Reiterate that Spark is not a DB. 5. Mention alternatives, e.g., YARN, Scala, Zeppelin.

About Dara Adib

Dara is a software engineer at Uber, where he works on backend systems responsible for the realtime execution and online optimization of Uber's marketplace. He has previously worked for Synthicity, a developer of urban planning and real estate software solutions acquired by Autodesk, and received his Bachelor's in Civil Engineering from UC Berkeley.