
23. June 2025
How do we know whether two data points from several sources correspond to the same actual item – like a laptop mentioned under multiple names on several e-commerce sites? This question lies at the core of the PhD thesis Entity Resolution on Heterogeneous Knowledge Graphs by Daniel Obraczka. The recently completed thesis investigates how graph architectures and machine learning could cooperate to address one of the most fundamental problems in data integration.
In today’s interconnected digital landscape, integrating data from multiple, disparate sources is a challenging task. Knowledge graphs have proven to be among the most potent tools in recent years for effectively representing complicated structured data. By encoding entities and their relationships in a semantically rich way, they can represent complicated structured information. However, when multiple knowledge graphs are combined, as is the problem with most semantic systems – each potentially using different structures, vocabularies, and assumptions – a difficult challenge arises. How is it possible to quickly and effectively determine which entities refer to the same real-world object? This task is called entity resolution.
A concrete example would be a price comparison platform for laptops that aggregates product data from multiple online retailers. While one retailer may offer the same model as Acer Aspire E1 Series, another
may list it as Acer Laptop E1-572-6459. Each may have different formats, features, or descriptions. Accurately identifying that these entries refer to the same model is essential – but far from trivial. The
structure of information contained in each entry’s knowledge graphs may be entirely different, giving the impression that each knowledge graph pertains to an entirely distinct entity. This is the essence of entity resolution. Its complexity increases when the data is organized in graph form, where entities are associated with rich contextual information such as brands, components, or reviews.
By combining structural and attribute-based signals, Daniel Obraczka’s thesis presents various ideas to enhance entity resolution in heterogeneous knowledge graphs. For this, he created several methods and frameworks:
Every framework and method is validated by strong Bayesian statistical testing and made available as open-source libraries.
Daniel Obraczka has been researching in the field of semantic data representation for almost ten years. From 2018 to 2024, he was a researcher at ScaDS.AI Dresden/Leipzig and previously a student researcher at Agile Knowledge Engineering and Semantic Web (AKSW). During his PhD at ScaDS.AI Dresden/Leipzig, he worked in the team of Prof. Erhard Rahm in the topic area Data Quality and Data Integration. Before his PhD, he received a B.A. in Sociology and a B.Sc. and M.Sc. in Computer Science.
On April 24, 2025, he successfully defended his thesis. Unfortunately, he will no longer be a part of ScaDS.AI Dresden/Leipzig. Although Daniel Obraczka is moving from academia to industry, he’s staying in touch with the latest breakthroughs in data science. As a research engineer at Damedic, he is helping to simplify hospital billing.