Matching Patterns and Constructing Graphs with Cypher for Apache Spark

Graph pattern matching is one of the most interesting and challenging operations in analytics. Uncovering patterns of relationships in real-work networks actually helps us reveal their inner structures and infer/predict their dynamic behavior. The Cypher graph query language was originally designed for transactional graph databases like Neo4j.

Spark developers and analysts can benefit from having this straightforward language available for analytic and data wrangling workloads. Cypher targets the property graph data model, making it easy to analyze highly connected datasets in a natural, uncomplicated way. Using a composable, declarative language for graphs reduces program complexity and allows complex data transformations. Under the umbrella of the openCypher project, Cypher is the first industrial language to provide composable property graph querying with multiple named graphs.

Graph construction is new in Cypher and a critical feature for the Spark world of immutable datasets and function chains. Neo4j initiated the Apache-licensed OSS project, Cypher for Apache Spark (CAPS), joining other Cypher language implementations like Neo4j, SAP HANA Graph, RedisGraph, Agens Graph and the OSS Cypher for Gremlin project. The language allows the intuitive definition of graph patterns including structural and semantic predicates. Cypher for Apache Spark is a graph mirror of SparkSQL, with a graph catalog, graph data sources, graph schemas, graph operations functions, and textual Cypher queries.

Graph querying and SQL querying can be interwoven at will, as Cypher can project graphs and tables, and process driving table inputs. We’ll explain the importance of graphs and Cypher within Big Data applications and the main challenges of implementing a schema-flexible data model and graph specific operators, e.g. for path computation, using DataFrames.

Takeaways:

Intro to the Cypher graph query language

Understand the benefits of graph-based data integration and analytics

Insights into Cypher for Apache Spark and how it parallels SparkSQL

Session hashtag: #SAISDev8



« back
About Martin Junghanns

Martin Junghanns is part of the Cypher for Apache Spark Engineering team at Neo4j. He is also the main developer of Gradoop, a system for graph analytics on distributed data flow systems. Martin holds a MSc Computer Science degree from the University of Leipzig.