Blocking for Big Data Integration
Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques.
Themis Palpanas is a professor of computer science at the Paris Descartes University (France), where he is a director of the Data Intensive and Knowledge Oriented Systems (diNo) group. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the University of Trento and the IBM T.J. Watson Research Center. He has also worked for the University of California, Riverside, and visited Microsoft Research and the IBM Almaden Research Center. His research solutions have been implemented in world-leading commercial data management products and he is the author of nine US patents. He is the recipient of three Best Paper awards (including ICDE and PERCOM), and the IBM Shared University Research (SUR) Award in 2012, which represents a recognition of research excellence at worldwide level. He has been a member of the IBM Academy of Technology Study on Event Processing, and is a founding member of the Event Processing Technical Society. He has served as General Chair for VLDB 2013, the top international conference on databases, and is now Editor in Chief for the BDR Journal, and Associate Editor for the TKDE Journal.