A few months ago, we held a live webinar – Apache Spark MLlib: From Quick Start to Scikit-Learn – to give a quick primer on machine learning, Spark MLlib, and an overview of some Spark machine learning use cases. It also covered multiple Spark MLlib quick start demos as well as the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib.
The webinar is accessible on-demand. Its slides and sample notebooks are also downloadable as attachments to the webinar. Join the Databricks Community Edition beta to get free access to Spark and try out the notebooks.
We have answered the common questions raised by webinar viewers below. If you have additional questions, please check out the Databricks Forum.
Common webinar questions and answers
Click on the question to see answer:
- Since Spark is distributed then how does it combine the result from each node when using MLlib? Does it give the same result as a single node?
- With small datasets (1K rows), is it possible to get a response for a regression or a decision tree learning task in 1 or 2 seconds? What is the overhead when using Spark on a laptop (single node)?
- When cross validation is being done for model parameter selection, are both the folds and the model+params distributed across the cluster?
- For the demo in the MLlib webinar, how are the features identified from the words, before linear regression?
- If one uses MLlib, does one still need Python or R?
- Which is best for ML: Python or R?
- In the MLlib webinar demo, were you using spark.ml based regression? Are you planning to freeze spark.mllib RDD based algorithms?
- MLlib SGD-based algorithms (linear models) can diverge unless the features are scaled; why is feature scaling not TRUE by default, and why such sensitivity in the first place?