Michael Malak

, Oracle

Michael Malak is the lead author of Spark GraphX In Action and has been developing Spark solutions at two Fortune 200 companies since early 2013. He has been programming computers since before they could be bought pre-assembled in stores.

SESSIONS

Neuro-Symbolic AI for Sentiment Analysis

Learn to supercharge sentiment analysis with neural networks and graphs. Neural networks are great at automated black-box pattern recognition, graphs at encoding and human-readable logic. Neuro-symbolic computing promises to leverage the best of both. In this session, you will see how to combine an off-the-shelf neuro-symbolic algorithm, word2vec, with a neural network (Convolutional Neural Network, or CNN) and a symbolic graph, both added to the neuro-symbolic pipeline. The result is an all-Apache Spark text sentiment analysis more accurate than either neural alone or symbolic alone. Although the presentation will be highly technical, high-level concepts and data flows will be highlighted and visually explained for the more casual attendees. Technologies used include MLlib, GraphX, and mCNN (from spark-packages.org) will be highlighted and visually explained for the more casual attendees. Technologies used: MLlib, GraphX, and mCNN (from spark-packages.org) Session hashtag: #SFr12

Extending Word2Vec for Performance and Semi-Supervised Learning

MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.

Finding Graph Isomorphisms In GraphX And GraphFrames

Identifying graph isomorphisms is one of the most powerful graph techniques, and has a wide variety of applications. In this presentation, you'll see how to find simple graph isomorphisms in GraphX, and how the exciting new GraphFrames from AMPlab -- intended for inclusion in Spark 2.x -- allows the use of SQL and a subset of Cypher (the query language from Neo4j) to find more complex graph isomorphisms. Applications covered include finding missing data from Wikipedia (using the YAGO3 data set), which is a form of graph mining, and fraud detection. Also covered will be, due to its newness, a brief overview of GraphFrames, its performance over GraphX due to Catalyst and Tungsten, and how to use it to query graphs using SQL and the Cypher subset.