Home // Duplicatedetection and Linking in Big Data Analytics Workflows
Type of thesis: Masterarbeit / location: Leipzig / Status of thesis: Finished theses
Improving data quality by duplicate detection and data cleansing is an important pre-processing step before meaningful data analysis can be performed. In Leipzig the Dedoop system for large scale duplicate detection tasks was developed at the University of Leipzig, which can identify large numbers of matching object efficiently. Dedoop helps in configuring Entity Matching workflows using a GWT UI and transforms these workflows into Map-Reduce jobs (recently also Apache Flink Jobs). Various techniques for load balancing of similarity calculations are applied and displayed by various interlinked Map-Reduce jobs.
One difficulty is to integrate the duplicate detection tool into existing analytics and ETL-Workflows.
A number of workflow-based data integration and analytics platforms exist like SAP Data Services, Rapid Miner or the KNIME Analytics Platform. Recently, those platforms began to shift to Big Data execution environments for efficient processing.
In this Bachelor-thesis we would like to investigate how the deduplication and linking features of Dedoop could be integrated into the KNIME Analytics Platform. KNIME provides a number of extension points where new nodes for its visual workflow modeler can be added. The goal is to integrate well with the Big Data extensions of KNIME so that data can be grabbed from Impala or Apache Hive directly and written to downstream Big Data Stores.
Within the thesis the student should tackle the following tasks:
We promise close supervision by members of the Big Data Center ScaDS. In some cases we could offer student positions before or after the thesis to dive into the topic.
Administration Director
Department of computer science
Leipzig University
ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.
Bürokomplex Falkenbrunnen Chemnitzer Str. 46b, 2. Obergeschoss 01187 Dresden
Löhrs Carré Humboldtstraße 25, 3. Obergeschoss 04105 Leipzig Postal address Leipzig: Universität Leipzig Data Science Zentrum Internes Postfach: 212104 04081 Leipzig
Copyright 2023 © ScaDS.AI Dresden/Leipzig – All rights reserved.