JavaScript is required to use this site. Please enable JavaScript in your browser settings.

Fabian Panse and Peter Christen (27.08.2025)

ScaDS.AI Dresden/Leipzig announces and welcomes you to join our public colloquium session on Wednesday, 27.08.2025, 2:00-4:00 p.m. The speakers are Prof. Fabian Panse and Prof. Peter Christen. The colloquium takes place at seminar room “Zwenkauer See” at ScaDS.AI Dresden/Leipzig (details below) and parallel online (link to Zoom session).

Prof. Dr. Fabian Panse


Evaluating Matching Algorithms in Data Integration:
Current Methods and Open Challenges

Evaluation is a crucial part of research, especially when developing new matching algorithms in the context of data integration. Whether it involves string matching, schema matching, or entity matching, meaningful evaluation requires applying these algorithms to suitable matching scenarios, using accurately and appropriately modeled ground truths, and selecting relevant evaluation metrics. In this talk, we will examine current evaluation practices in the research community, highlight common sources of error, pitfalls, and weaknesses, identify open challenges, and discuss current research approaches in this area.

About Fabian Panse

Born in North Hesse, Fabian Panse studied computer science in Clauthal-Zellerfeld and Braunschweig. He then completed his PhD at the University of Hamburg. His doctoral thesis focused on the detection of duplicates in probabilistic relational databases. After completing his PhD, Mr. Panse remained in Hamburg for several years, first as a postdoctoral researcher and then as a substitute professor, before moving to the Hasso Plattner Institute in Potsdam in April 2023 as a postdoctoral researcher. Since October 2024, he has been a Professor of Data Engineering at the University of Augsburg. His research interests include data quality, entity matching, schema matching, data integration, benchmarking, and the synthesis of tabular data.


Prof. Dr. Peter Christen


In my presentation I will cover two topics, both discussing new and ongoing research in the context of record linkage (also known as entity resolution and data matching), the challenging task of identifying records that refer to the same entity within and across databases. The first talk discusses aspects of how to measure record linkage quality. The second talk covers data privacy and shows that even when privacy-preserving record linkage techniques are used some sensitive information can be leaked during record linkage protocols.

Consistently Evaluating Record Linkage Classification Results

Record linkage is commonly viewed as the problem of classifying record pairs into matches and non-matches. In situations where ground truth data are available, performance measures such as precision, recall, the F-measure, sensitivity, and specificity, are commonly used to evaluate the quality of matches obtained with a trained record linkage classifier. Comparing multiple classifiers using such measures can, however, lead to inconsistent evaluation because for some measures it is possible that the same result is obtained from different classification outcomes.

This can cause a suboptimal classifier being selected and potentially result in linked data sets of poor quality. To overcome this problem, we propose the Consistent Record Linkage (CRL) measure, an application focused evaluation method that ensures record linkage classifiers are assessed in a fair and transparent way. In this presentation we introduce the CRL-measure, and using both synthetic and real data sets show how it can provide more detailed information about the performance of record linkage classification results compared to traditional performance measures.

Information Leakage in Record Linkage Protocols

Linking databases that contain sensitive personal data across organisations is an increasingly important requirement in the health and social sciences, as well as with governments and businesses. To protect personal data, protocols have been developed to limit the leakage of sensitive information, while privacy-preserving record linkage (PPRL) techniques allow record linkage to be conducted on encoded data without the need of sharing sensitive plaintext data across organisations. While PPRL techniques are now being employed in real-world applications, the focus of PPRL research has been on the technical aspects of linking sensitive data (such as encoding methods and cryptanalysis attacks), but not on organisational challenges when employing such techniques in practice.

We analyse what sensitive information can possibly leak, either unintentionally or intentionally, in traditional record linkage as well as PPRL protocols, and what a party that participates in such a protocol can learn from the data it obtains legitimately within the protocol. We also show that PPRL protocols can still result in the unintentional leakage of sensitive information.

About Peter Christen

Peter Christen is a world-leading expert in record linkage with over 20 years’ experience in working with administrative data. He has over 200 publications in the area of data science, including the two books “Data Matching” in 2012 and “Linking Sensitive Data” (co-authored with Thilina Ranbaduge and Rainer Schnell) in 2020. Peter is an award winning university lecturer who has been teaching in the area of data science since 2002. He has developed multiple large courses on topics such as data mining and data wrangling, and given workshops and tutorials on record linkage and data quality since 2008.


Location

ScaDS.AI Dresden/Leipzig
Löhrs Carré, Humboldtstrasse 25, 04105 Leipzig
3rd floor, large seminar room (A 03.07 “Zwenkauer See”)

funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.