Graph pattern matching is one of the most interesting and challenging operations in analytics. Uncovering patterns of relationships in real-work networks actually helps us reveal their inner structures and infer/predict their dynamic behavior. Query languages like Cypher, implemented in systems like Neo4j, SAP HANA Graph and Redis Graph, allow the intuitive definition of graph patterns including structural and semantic predicates. At the moment, graph query languages are most prominent in graph database systems such as Neo4j. However, we think that distributed processing systems like Apache Spark can also benefit from having such a straightforward language in their toolbox. The Cypher query language is designed for the property graph data model, making it easy to analyze highly connected, semi structured datasets in a natural, uncomplicated way. To bring the benefits of Cypher from the graph database realm into the world of Big Data, we at Neo4j are developing Cypher for Apache Spark (CAPS). The power of CAPS lies in the combination of distributed graph-based data integration and graph analytical query workloads in Spark with extremely flexible operations that lift, integrate and store graphs from many different sources, such as Neo4j, SQL databases or HDFS.
In our talk we will present our approach of translating a semi-structured, schema optional data model and graph data model to Spark DataFrames and how we compute graph pattern matches in a distributed data flow system using relation operations. We will also give an overview of CAPS’ querying capabilities, which we’ll demonstrate using notebook examples.
Martin Junghanns is part of the Cypher for Apache Spark Engineering team at Neo4j. Apart from that, he is finishing his PhD in Computer Science at the University of Leipzig. Martin is working on the Gradoop project with a focus on distributed graph analytics, graph data models and analytical DSLs.
Max Kießling is part of the Cypher for Apache Spark Engineering team at Neo4j. He recently finished his Master’s thesis at the University of Leipzig, in which he researched distributed pattern matching as part of the Gradoop project.