Linking and Duplicate Detection in Big Graphs – Similarity Measures

Type of thesis: Bachelorarbeit / location: Leipzig / Status of thesis: Finished theses

The graph-based storage and processing of large amounts of data is becoming increasingly important. In our work we encounter large networks of interactions between genes, proteins and processes in the life sciences, chemical compounds and their reactions in chemistry or information graphs in the business domain. A particularly prominent example Facebook offers its users access to information of the social network through a graph search.

In a current project at the University of Leipzig, a novel graph-processing platfom (GRADOOP) is developed, which simplifies the entire process of creating a graph, its processing and analysis with the help of standardized operators and workflows. These workflows are then efficiently executed and distributed by using Apache Flink.

An important initial step is the creation of graphs by linking various data sources and improving data quality by duplicate detection and data cleansing. A prototypical version of Gradoop already contains operators for calculating object similarities as well as load balancing features. Existing similarity-measures are mainly based on the individual properties of the objects that are compared. However, their neihbors and neibors of neibors could give a good indicator for the similarity of two objects.

In this Masters Thesis we would like to investigate graph-based similarity measures and how they could be efficiently implemented on top of GRadoop and Apache Flink.

The work includes the following subtasks:

  • Overview to related work in graph-based similarity measures
  • Concept of new graph-based similarity measure that take neighboring information into account
  • Protoypical implementation on top of Gradoop and Apache Flink
  • Evaluation of the developed concepts and several datasets – such as dataset of publications, conferences and authors


We promise close supervision by members of the Big Data Center ScaDS. In some cases we could offer student positions before or after the thesis to dive into the topic.



Eric Peukert

Administration Director

Department of computer science

Universität Leipzig