Data integration is a key challenge for Big Data applications to analyze large sets of heterogeneous data of potentially different kinds, including structured database records as well as semi-structured entities from web sources or social networks. In many cases, there is also a need to deal with a very high number of data sources, e.g. product offers from many e-commerce websites. We will cover proposed approaches to deal with the key data integration tasks of (large-scale) entity resolution and schema or ontology matching. In particular, we discuss parallel blocking and entity resolution on Hadoop platforms together with load balancing techniques to deal with data skew. We also discuss recent approaches and challenges for holistic data integration of many data sources, e.g., to create knowledge graphs or to make use of huge collections of web tables.
Erhard Rahm is full professor for databases at the computer science institute of the University of Leipzig, Germany. His current research focusses on Big Data and data integration. He has authored several books and more than 200 peer-reviewed journal and conference publications. His research on data integration and schema matching has been awarded several times, in particular with the renowned 10-year best-paper award of the conference series VLDB (Very Large Databases) and the Influential Paper Award of the conference series ICDE (Int. Conf. on Data Engineering). Prof. Rahm is one of the two scientific coordinators of the new German center of excellence on Big Data ScaDS (competence center for SCAlable Data services and Solutions) Dresden/Leipzig.
