Title: Privacy-Preserving Record Linkage
Duration: 2014 – today
Research Area: Responsible AI
Record linkage is an essential component in many data integration tasks with multiple data sources. It aims to detect records that belong to the same real-world entity, such as a person. Typically, there is a lack of global identifiers; therefore the linkage can only be achieved by comparing available quasi-identifiers, such as name, address, or date of birth. However, often, data owners are only willing or allowed to provide their data for such data integration if there is sufficient protection of sensitive information to ensure the privacy of individuals, such as patients or customers. For example, in medical research, data of several sources (e.g., hospitals) has to be matched to investigate possible correlations between some diseases of the same patients without revealing the identity of patients.
Privacy Preserving Record Linkage (PPRL) addresses this problem and thereby enables the combination of sensitive data from different sources for improved data analysis and research.
The aim of this project is to study existing and develop new methods for Privacy Preserving Record Linkage that allow to match records while preserving their privacy. For this purpose, the linkage of person-related records is based on encoded values of the quasi-identifiers and the data needed for analysis (e.g., health data) is separated from these quasi-identifiers. The relevant data can be provided to a researcher without the identifying data.
PPRL is confronted with many challenges needing to be solved to ensure its practical applicability. In particular, a high degree of privacy has to be ensured by suitable encoding of sensitive data and organizational structures, such as the use of a trusted linkage unit. PPRL must achieve a high linkage quality by avoiding false or missing matches. Furthermore, a high efficiency with fast linkage time and scalability to large data volumes are needed.
PPRL can be applied in many areas, such as public health, demographical studies and marketing analysis. We therefore developed an open-source toolbox for the flexible definition and execution of PPRL workflows: PRIMAT. It offers modules for data owners and the linkage unit that provide state-of-the-art PPRL methods, including various encoding and hardening techniques, LSH-based blocking, post-processing (clustering) and more.
We mainly focus on Bloom-Filter-based encodings which have been shown to allow for very efficient linkage of large databases while providing sufficient privacy protection. PRIMAT is implemented in Java and can be used via dockered Spring-Boot-based web services as well.
In future work, we will focus on developing techniques that enable reliable high-quality linkage results on varying datasets and provide data custodians with performance indicators. We presume that those are essential for further real-world applications of PPRL.