Building the Dresden Web Table Corpus and Beyond
The Web has developed into a comprehensive resource not only for unstructured or semi-structured data, but also for relational data. Millions of relational tables embedded in HTML pages or published in the course of Open Data initiatives provide extensive information on entities and their relationships from almost every domain. Whereas in the past only big search engine companies where able to crawl and make use of these large volume data, the situation changed with the advent of the Common Crawl Foundation, a non-profit foundation that crawls the Web and regularly publishes the resulting Web corpora for public usage. We exploit these new opportunities and developed the Dresden Web Table Corpus (DWTC) consisting of 125 million unique tables.
In this talk, we will present an extensive Web table layout classification that enables us to identify the main layout categories of Web tables with very high precision. On top of that we will outline two post-processing approach to integrate Web tables in a light-weight manner. Specifically, we will discuss a novel approach to identify and extract column-specific information from the context of Web tables as well as a normalization approach to decompose multi-concept Web tables into smaller single-concept tables.
Dr.-Ing. Maik Thiele is postdoc researcher at the Database Systems Group in Dresden where he finished his dissertation on “Quality-Driven Data Production Controlling in Real-Time DW Systems” in May 2010 and received his doctorate with distinction. He was a visiting scientist at UBS Zurich, GfK Nuremberg, and HP Labs Palo Alto. His research interests include large-scale data processing, information extraction and data integration. Since 2013 he is also working in the collaborative research center HAEC (Highly Adaptive Energy-Efficient Computing) coordinating the Software projects.