Last week, we held a live webinar – GraphFrames: DataFrame-based graphs for Apache Spark – to give an overview, a live demo, and a discussion of design decisions and future plans of the new GraphFrames library. The webinar included content for people just getting started with Apache Spark, as well as seasoned experts. The webinar started with a recap of major improvements from GraphX, and providing resources for getting started. A running example of analyzing flight delays was shown to illustrate the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.For the experts, this talk included a few technical details on design decisions, the current implementation, and ongoing work on speed and performance optimizations.
The webinar is accessible on-demand. Its slides and sample notebooks are also downloadable as attachments to the webinar. Join the Databricks Community Edition beta to get free access to Spark and try out the notebooks.
We have answered the common questions raised by webinar viewers below. If you have additional questions, please check out the Databricks Forum.
Common webinar questions and answers
Click on the question to see answer:
- Can GraphFrames handle multiple types of relationships (or edges), each with its own set of properties? Will it be all in the single dataframe as input?
- Are there integration plans for GraphFrames with the MLlib pipeline API so that we can leverage existing cross-validation/hyperparameter optimization for graph algorithms?
- With GraphFrames, is there a way to incrementally build graphs, either with an API, e.g. addVertex(), addEdge(), or by loading data from multiple files one by one?
- I tried GraphFrames for connected components in a graph with 3.7M vertices and 2.1M edges. However, I ran into performance/scalability issues. Could you give some details about the underlying algorithm and its algorithmic complexity?
- With GraphFrames, can you create graphs from adjacency matrix or from crosstab(col1, col2), which computes a pair-wise frequency table of the given columns?
- With GraphFrames, are there ways of dealing with multiple types of vertices in the same data set? i.e. edges that span two different frames with different meta-data and treating the underlying data frames as a typed vertex?