Bill Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill holds a Master’s Degree in Information Management and Systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of the Data Analysis in Python with pandas course for Udemy and co-creator of and first instructor for Python for Data Science, part of UC Berkeley’s Master of Information and Data Science.
Running Spark and Python data science workloads can be challenging given the complexity of the various data science tools in the ecosystem like sci-kit Learn, TensorFlow, Spark, Pandas, and MLlib. All these various tools and architectures, provide important trade-offs to consider when it comes to moving to proofs of concept and going to production. While proof of concepts may be relatively straightforward, moving to production can be challenging because it's difficult to understand not just the short term effort to develop a solution, but the long term cost of supporting projects over the long term. This talk will discuss important tactical patterns for evaluating projects, running proofs of concept to inform going to production, and finally the key tactics we use internally at Databricks to take data and machine learning projects into production. This session will cover some architectural choices involving Spark, PySpark, Pandas, notebooks, various machine learning toolkits, as well as frameworks and technologies necessary to support them. Key Takeaways will include: 1. How best to organize projects given a variety of tools, 2. how to better understand the tradeoff of single node and distributed training of machine learning models, and 3. how we organize and execute on data science projects internally at Databricks.
Running data science workloads is challenge regardless of whether you are running them on your laptop, on an on-premises cluster, or in the cloud. While buying 100% managed service is an option, these tools can be expensive and lack extensibility. Therefore, many companies option for open source data science tools like scikit-learn and Apache Spark's MLlib in order to balance both functionality and cost. However, even if a project succeeds at a point in time with any set of tools, these projects become harder and harder to maintain as data volumes increase and a desire for real-time pushes technology to its limit. New projects also struggle as new challenges of scale invalidate previous assumptions. This talk will discuss some patterns that we see at Databricks that companies leverage to succeed with their data science projects. Key takeaways will be: - Striving for simplicity - Removing cognitive load for you and your team - Working with data, big and small - Effectively leveraging the ecosystem of tools to be successful Session hastag: #SAISDS1