Data integration is of key importance for Big Data analysis to semantically combine the data from multiple sources. The data integration approaches have to be scalable to high data volumes and many sources and have to provide high data quality. For sensitive, person-related data a high degree of privacy should also be preserved. At ScaDS Dresden/Leipzig we address these challenges, in particular within a new approaches for graph-based data analytics supported by the open-source system GRADOOP.
After an introduction, we discuss new approaches for graph-based and holistic data integration. In particular, we present FAMER (FAst Multi-source Entity Resolution), a parallel system to link and cluster entities from many sources and a small demo of our Graph-Analytics Tool. We further present scalable approaches for privacy-preserving record linkage (PPRL).
FAMER (FAst Multi-source Entity Resolution system) is a new scalable framework for distributed multi-source entity resolution. While existing link discovery methods focus on finding binary links between pairs of sources, FAMER supports a more holistic data integration by clustering equivalent entities from many sources. Such an approach is especially useful for constructing large knowledge graphs from many sources. FAMER constructs a so-called similarity graph for the entities of interest as basis for clustering. It supports parallel versions of several clustering schemes including a new approach called CLIP that favors so-called strong entity links. CLIP can also be used to repair clusters determined by other methods such as connected components or correlation clustering. FAMER is based on Apache Flink and its parallel execution supports scalability to large data volumes.
Privacy Preserving Record Linkage (PPRL) addresses the problem of matching person al records across different databases without revealing any sensitive information. It allo ws the combination of data from different sources for improved data analysis and research while not sharing uncoded identifying in formation. The linkage of person-related records (e. g., patients in hospitals) is based on encoded values of quasi-identifiers (e. g., name, address). The data needed for analysis (e. g., health data) is separated from these quasi-identifiers and can be linked with the ID pairs resulting from the PPRL process.
PPRL is confronted with several challenges needing to be solved to ensure its practical applicability. In particular, a high degree of privacy has to be ensured by suitable encoding of sensitive data and organizational structures, such as the use of a trusted linkage unit. PPRL must achieve a high linkage quality by avoiding false or missing matches. Furthermore, a high efficiency with fast linkage time and scalability to large data volumes are needed despite the inherent quadratic complexity of the problem. The talk will give an overview of our research results and plans for future work in this area.
Erhard Rahm is full professor for databases at the computer science institute of the University of Leipzig, Germany. His current research focusses on Big Data and data integration. He has authored several books and more than 200 peer-reviewed journal and conference publications. His research on data integration and schema matching has been awarded several times, in particular with the renowned 10-year best-paper award of the conference series VLDB (Very Large Databases) and the Influential Paper Award of the conference series ICDE (Int. Conf. on Data Engineering). Prof. Rahm is one of the two scientific coordinators of the new German center of excellence on Big Data ScaDS (competence center for SCAlable Data services and Solutions) Dresden/Leipzig.
Dr. Eric Peukert coordinates the Service Center for Big Data at the University of Leipzig as part of ScaDS Dresden/Leipzig. He studied Computer Science and Media at the Dresden University of Technology and worked at SAP Research in the field of data integration and schema mapping within various BMBF and EU research projects. After completing his doctorate at the University of Leipzig and two more years with SAP, Mr. Peukert switched to the ScaDS. Mr. Peukert coordinates the activities of the center in Leipzig with a special focus on industry contacts and cooperations. His research includes big data technologies, data integration and learning-based duplicate detection methods.
Alieh Saeedi is a PhD student in computer science at university of Leipzig. she work in database group under the supervision of Dr. Eric Peukert and Prof. Erhard Rahm. Her research focuses on Entity Resolution (ER) in big data from multiple sources.
Marcel Gladbach is researcher and PhD student in the database group at the University of Leipzig after receiving his M.Sc. at the Leipzig University of Applied Science (HTWK Leipzig) in 2017. His research focus is Privacy-Preserving Record Linkage and its practical application within the Medical Informatics Initiative Germany.