Training Transformer-based Entity Matchers using the Web as Supervision
The adoption of schema.org annotations on the Web has sharply increased over the last decade with hundreds of thousands of websites annotating data about products, events, local businesses, and job postings within their pages. Entity matching is a central challenge for integrating data from multiple data sources. In the talk, Christian Bizer will discuss how large quantities of training data for entity matchers can be derived from schema.org annotations on the Web. He will further demonstrate how Transformer-based matching methods are able to exploit the richness of the training data that is available on the Web for head entities while the methods also excel on matching long-tail entities using contrastive pre-training and cross-language learning.
Christian Bizer explores technical and empirical questions concerning the development of global, decentralized information environments. His current research focus is the evolution of the World Wide Web from a medium for the publication of documents into a global dataspace. Christian Bizer initialized the W3C Linking Open Data community effort which is interlinking large numbers of data sources on the Web. He co-founded the DBpedia project which derives a comprehensive knowledge base from Wikipedia. He also initialized the WebDataCommons project which monitors the adoption of schema.org, RDFa, JSON-LD, and Microdata annotations on the Web by analyzing large web crawls. His technical research focuses on the integration of data from large numbers of data sources and includes topics such as information extraction, entity resolution, schema matching, data fusion, and data search.
Back to the Summer School 2022 overview