The analysis and integration of very large network datasets is becoming increasingly valuable, for example to gain insights from logistics, business processes and social networks. By representing network data as a graph, complex relationships can be analyzed between heterogeneous data objects. In the talk, we will give a technology insight into Gradoop, an Apache Flink-based open source system that provides scalable distributed algorithms for integrating and analyzing graph data. The Gradoop framework allows data scientists and analysts to express complex graph analysis tasks using simple and intuitive analytical workflows.
In the second part of the talk we introduce basics of entity resolution in graphs. We introduce the FAMER-System for data integration and its applied clustering approaches that can be used for linking records in a graph and to identify duplicates. Famer ist also implemented on top of Gradoop and supports many kinds of multi-source entity resolution(ER) tasks.