June 23, 2025

Comparing the Incomparable: Daniel Obraczka Successfully Defends his PhD Thesis

Education

How do we know whether two data points from several sources correspond to the same actual item – like a laptop mentioned under multiple names on several e-commerce sites? This question lies at the core of the PhD thesis Entity Resolution on Heterogeneous Knowledge Graphs by Daniel Obraczka. The recently completed thesis investigates how graph architectures and machine learning could cooperate to address one of the most fundamental problems in data integration.

Entity Resolution on Heterogeneous Knowledge Graphs

In today’s interconnected digital landscape, integrating data from multiple, disparate sources is a challenging task. Knowledge graphs have proven to be among the most potent tools in recent years for effectively representing complicated structured data. By encoding entities and their relationships in a semantically rich way, they can represent complicated structured information. However, when multiple knowledge graphs are combined, as is the problem with most semantic systems – each potentially using different structures, vocabularies, and assumptions – a difficult challenge arises. How is it possible to quickly and effectively determine which entities refer to the same real-world object? This task is called entity resolution.

A concrete example would be a price comparison platform for laptops that aggregates product data from multiple online retailers. While one retailer may offer the same model as Acer Aspire E1 Series, another
may list it as Acer Laptop E1-572-6459. Each may have different formats, features, or descriptions. Accurately identifying that these entries refer to the same model is essential – but far from trivial. The
structure of information contained in each entry’s knowledge graphs may be entirely different, giving the impression that each knowledge graph pertains to an entirely distinct entity. This is the essence of entity resolution. Its complexity increases when the data is organized in graph form, where entities are associated with rich contextual information such as brands, components, or reviews.

New Methods and Frameworks for Entity Resolution

By combining structural and attribute-based signals, Daniel Obraczka’s thesis presents various ideas to enhance entity resolution in heterogeneous knowledge graphs. For this, he created several methods and frameworks:

EAGER: A framework that maps entities into a continuous vector space using knowledge graph embedding and traditional attribute similarity metrics. Combining allows more precise matching across graph structures. Combining these signals outperforms methods for single-type data or tabular data in experiments on shallow graphs (e.g. media datasets) and highly connected graphs (e.g. Wikidata).
Kiez: A method for addressing the hubness phenomenon in high-dimensional vector spaces, where objects appear too similar. This may reduce matching accuracy. Using Approximate Nearest Neighbor (ANN) search, Kiez reduces hubness and maintains computational efficiency, producing better match candidates with fewer performance trade-offs.
klinker: A hybrid blocking method combining symbolic and embedding approaches. Blocking, developed using string-level rules to group likely matches and reduce entity comparisons, often fails on heterogeneous or relational data. Semantic embeddings and relational context make klinker’s blocking more efficient and scalable. In accuracy and speed, it beats many deep learning baselines.

Every framework and method is validated by strong Bayesian statistical testing and made available as open-source libraries.

Daniel Obraczka

Daniel Obraczka has been researching in the field of semantic data representation for almost ten years. From 2018 to 2024, he was a researcher at ScaDS.AI Dresden/Leipzig and previously a student researcher at Agile Knowledge Engineering and Semantic Web (AKSW). During his PhD at ScaDS.AI Dresden/Leipzig, he worked in the team of Prof. Erhard Rahm in the topic area Data Quality and Data Integration. Before his PhD, he received a B.A. in Sociology and a B.Sc. and M.Sc. in Computer Science.

On April 24, 2025, he successfully defended his thesis. Unfortunately, he will no longer be a part of ScaDS.AI Dresden/Leipzig. Although Daniel Obraczka is moving from academia to industry, he’s staying in touch with the latest breakthroughs in data science. As a research engineer at Damedic, he is helping to simplify hospital billing.

Previous Entry Back to Overview Next Entry

Detecting Deepfakes – Supporting bpb Student Competition

Events

How can we distinguish between real and fake when videos and voices can be perfectly […]

Marie Liv Henke’s Internship at ScaDS.AI Dresden/Leipzig

ScaDS.AI Dresden/Leipzig

On September 22 – October 3, 2025, Marie Liv Henke, a 9th-grade student at the […]

Birte Platow Answers Questions on Using AI in Schools for KLASSE Magazine

Press Releases and Reports

In Augst 2025, Prof. Birte Platow gave an interview for KLASSE magazine. In the article […]

tryING 2025: Insights Into AI, Laser Cutting, and Design Thinking methods

Events

ScaDS.AI Dresden/Leipzig took part in tryING 2025, a trial course that gives female high school […]

funded by:

Gefördert vom Bundesministerium für Bildung und Forschung.

ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.