Duplicatedetection and Linking in Big Data Analytics Workflows

Type of thesis: Masterarbeit / location: Leipzig / Status of thesis: Finished theses

Improving data quality by duplicate detection and data cleansing is an important pre-processing step before meaningful data analysis can be performed. In Leipzig the Dedoop system for large scale duplicate detection tasks was developed at the University of Leipzig, which can identify large numbers of matching object efficiently. Dedoop helps in configuring Entity Matching workflows using a GWT UI and transforms these workflows into Map-Reduce jobs (recently also Apache Flink Jobs). Various techniques for load balancing of similarity calculations are applied and displayed by various interlinked Map-Reduce jobs.

One difficulty is to integrate the duplicate detection tool into existing analytics and ETL-Workflows.

A number of workflow-based data integration and analytics platforms exist like SAP Data Services, Rapid Miner or the KNIME Analytics Platform. Recently, those platforms began to shift to Big Data execution environments for efficient processing.

In this Bachelor-thesis we would like to investigate how the deduplication and linking features of Dedoop could be integrated into the KNIME Analytics Platform. KNIME provides a number of extension points where new nodes for its visual workflow modeler can be added. The goal is to integrate well with the Big Data extensions of KNIME so that data can be grabbed from Impala or Apache Hive directly and written to downstream Big Data Stores.

 

Within the thesis the student should tackle the following tasks:

  • Provide an overview to existing ETL-Tools and their ability to perform efficient duplicate detection
  • Concept and implementation of a prototypical deduplication operator in the KNIME-Analytics Platform
  • Implementation and Evaluation of a small use case – showcasing the resulting KNIME-Extension

We promise close supervision by members of the Big Data Center ScaDS. In some cases we could offer student positions before or after the thesis to dive into the topic.

Counterpart

Dr. Eric Peukert

Administration Director

Department of computer science

Leipzig University

TU
Universität
Max
Leibnitz-Institut
Helmholtz
Hemholtz
Institut
Fraunhofer-Institut
Fraunhofer-Institut
Max-Planck-Institut
Institute
Max-Plank-Institut