Democratizing Distributed Compute and ML: A Tour of Three Frameworks - Databricks

Democratizing Distributed Compute and ML: A Tour of Three Frameworks

Democratizing has become a bit of a buzzword, and why not? Institutions of all types and sizes are discovering that almost every role touches a bit of large-scale data analysis or data science, and sometimes more than just a bit! In this talk we’ll look at the patterns, strengths, and weaknesses of three different open-source tools, which all claim to make large-scale computation simpler, easier, and more accessible to more people. Our exploration will reveal not only major differences at the technical level, but also differences in culture, documentation, usability, open-source governance, and other areas. How easy are they to use, for real people in real organizations?

We’ll look at:

  • Apache Spark, a very well established cluster computing tool suited to many kinds of work. Among other languages, Apache Spark boasts Spark SQL, which allows a huge number of SQL-capable folks to work on big data.
  • Ray, a newer, multilanguage framework from UC Berkeley’s RISE lab. Ray focuses on simplifying the operational scaffolding beneath distributed task graphs and actor sets, and offers Python and Java interfaces (with more languages planned).
  • Dask, a Python-native library and part of the SciPy ecosystem dedicated to scaling popular tools like Pandas and NumPy to lots of cores, nodes, and even GPUs. Dask lets users apply their existing Python knowledge by, e.g., supporting scalable machine learning based on the scikit-learn API, and extends to arbitrary task graphs.

All of these projects focus in some way on ease of use, and all have expanded the abilities of normal humans to work with data at scale. But they are also each quite different. This talk will help you think about what’s easy, what’s hard, what life is like with these tools, and which ones may be right for your organization.



« back
About Adam Breindel

Independent

Adam Breindel has over 15 years of successes working with cutting-edge technology for small startups, as well as major players in the travel, media/entertainment, and financial industries. He has been teaching front- and back-end technology for more than 8 years. In addition to web sites, GUI applications, and mobile device software, Adam has built streaming analytics for one of the world's largest banks, and produced a modern integration to a 1960s-vintage mainframe app for one of the world's largest airlines. His big data work has also included fraud modeling and scoring for debit card transactions. Adam focuses on designing and coding systems in a way that yields predictable results, leverages best practices and high-productivity tools, minimizes excess code, and is fun to do.