MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.
Michael Malak is the lead author of Spark GraphX In Action and has been developing Spark solutions at two Fortune 200 companies since early 2013. He has been programming computers since before they could be bought pre-assembled in stores.