Skip to main content
Company Blog

Last week, we held a live webinar, Apache Spark MLlib 2.x: Migrating ML Workloads to DataFrames, to demonstrate the ease with which you can migrate your MLlib RDD-based workloads to Spark 2.x MLlib DataFrame-based APIs, gaining all the benefits of simpler APIs, performance and persistence.

Mixing presentation and demonstration, we covered migrating workloads from RDDs to DataFrames, showed how ML persistence works across languages for saving and loading models, and shared the roadmap ahead.

With the webinar now accessible on-demand, you can view the webinar at will, download the presentation slides via the attachments tab as well as access the two notebooks that demonstrate how to migrate your ML workloads from RDDs to DataFrames.

Also, we answered many questions raised by webinar attendees below. If you have additional or related questions, check out the Databricks Forum or the new documentation resource.

Common Webinar Questions and Answers

Click on the question to see the answer.

  1. Can you please explain the distinction between (a) mllib vs ml, and (b) DataFrames vs DataSets?
  2. Most of the pipeline / dataframe based feature (tokenizer ) already available in 1.6 as well. What are the key / real new thing in MLlib 2.0?
  3. What is the difference between old ML features vectors and new ML features vectors? What does the conversion do?
  4. Do we need to consider data normalization in the new API? e.g., z-score?
  5. Can we expect a k-Nearest-Neighbors implementation in Spark ML/MLlib any time soon?
  6. Could you show a few examples to use parameter matrix during model validation?
  7. Can I save model from mllib pyspark and load the model in scala?
  8. Will the UC Berkeley / edX / Databricks classes be modified to reflect these spark.ml modifications?
  9. So spark.mllib is RDD based and new spark.ml is data frame based?
  10. What kind of ML algorithms will benefit from Tungsten/Catalyst (once ML is ported onto SparkSQL), just memory heavy ones or also communication-heavy ones? E.g. we see RandomForest being up to 100 times slower than local, sklearn counterparts.
  11. While this ML persistence is nice, but do you run into naming collision? i.e., variables/dataframe in current environment having same name with variables/df from loaded pipeline?
  12. Can you share a quick example of sharing a pipeline among different languages?
  13. Can I save model from mllib pyspark and load the model in scala?
  14. Do the Databricks notebooks allow us to experiment with the streaming modeling algorithms? Did those change much from 1.6 to 2.x?
  15. Is Random Forest package available for Spark R API?
  16. What is upper limit of the length of the feature vector, say for k-means? or is there a limit?
  17. Above demo source code is available to public?
  18. Will udf transforming Datasets be optimized by Catalyst or Tungsten?
  19. Is the Spark Dataset superseding Dataframes? Will ML run on Datasets?

Read More

If you’d like free access to Databricks, you can sign up for a free 14-day trial today.