Spark has supported Python as a first-class language for a long time, which is useful for data scientists who work with Python on a single machine with tools such as pandas and scikit-learn because they can use the same language for both medium-data and big-data analysis. One of the fundamental issues with Python and its scientific libraries is package management. Foundational tools such as NumPy, SciPy, pandas and others are heavily based on C/C++/Fortran libraries and extensions, which makes them difficult to install compared to a pure Python library. Package management problems with Spark often scale along with the number of nodes in a cluster. Anaconda is a free and open-source Python distribution that solves the Python packaging problem for scientific libraries and has become the de-facto Python installation for data scientists and analysts using Python. In this talk, we will discuss different ways to use Anaconda with Spark, including an Anaconda parcel for Cloudera CDH clusters and other cluster management functionality to manage the deployment of Anaconda packages and dependencies across multiple nodes. Once we’ve provisioned a cluster with the necessary Python libraries, we face another issue with Spark and its integration with Python. Spark and its RDD abstraction doesn’t integrate entirely with the PyData ecosystem that is based on NumPy. We will discuss techniques that combine the best of both worlds, Spark’s RDD abstraction that gives us big data and Python scientific library collection to make the work faster and easier. The examples will attempt to cover multiple areas within data analysis: the canonical word count using pure Python, natural language processing using NLTK, machine learning using scikit-learn and TensorFlow, image analysis using SciPy and Numba, image analysis and deep learning with GPUs and caffe.
Daniel Rodriguez is a Data Scientist at Continuum Analytics who works on the Cluster and DevOps teams to help customers in various industries deploy analytics engines. Daniel has 3+ years of experience working with data analytics tools in the Python and Hadoop ecosystems, ranging from optimizing computations on a single machine to deploying large production cluster environments.