Successful ScaDS Big Data School in Leipzig – a Report

26.07.2016 // SCADS

From 11th -15th of July, our center hosted the second ScaDS Big Data School in Leipzig. The program attracted many students and young graduates, as well as other academic and industrial practitioners and researchers that operate in the field of Big Data. We were overwhelmed with the number of registrations and the many speakers that supported us in our summer school. While we initially planned with 50 attendees, in total we finally counted 120 people at our summer school . Those included speakers and short-term attendees throughout the week. Surprisingly more than 50% of attendees were international coming from all continents. 

As we envisoned we had an inspiring mix of motivating keynotes, online trainings, classes and excursions. The ScaDS summer school tought the basics of working with large and complex amounts of data. Furthermore, it provided an overview of relevant approaches and solutions. In particular the practical sessions gave participants a fist glimpse into the basics of stroring, processing and analyzing big (graph) data. 


On July 11th, we started our 2nd International ScaDS Big Data School at Leipzig University. Prof. Dr. Erhard Rahm welcomed around 100 guests and started as the first speaker with a presentation on “Big Data Integration” introducing recent approaches and challenges for holistic data integration of many data sources and discussing e.g. parallel blocking and entity resolution on Hadoop platforms.

With a presentation on “Benchmarking Graph Data Analysis with LDBC”, Prof. Dr. Peter Boncz from the VU University Amsterdam continued. He got the opportunity to introduce the system LDBC, an EU project that involves him as scientific director, and focused his talk on “choke-point” based benchmark development (Social Network Benchmark).

Prof. Dr. Peter Boncz and Dr. Eric Peukert shaking hands at the ScaDS Big Data School
Prof. Dr. Peter Boncz and Dr. Eric Peukert

After a break, Dr. Sherif Sakr from the King Saud bin Abdulaziz University for Health Sciences gave an overview on the recent developments in his talk “Big Data 2.0 Processing Engines: The Time After Hadoop” and discussed the directions for future research as well as the latest challenges that exceed the limitations of Hadoop frameworks. For the first evening we organised a Sightseeing Tour for our guests to introduce them to our 1000 years old city that is known for historical events like the Monday demonstrations or famous inhabitants and guests, like Gottfried Wilhelm Leibniz or Johann Wolfgang von Goethe. 


The second day started under the headline “Big Data Storage/NoSQL”. The introductory presentation “NoSQL: State of the Art & New Developments” of Prof. Dr. Stefan Edlich (professor at the Beuth University of Applied Sciences, Berlin) covered the different NoSQL applications in the DB landscape and how NoSQL will affect future approaches. Prof. Dr. Andreas Thor, from the Leipzig University of Telecommunications (HfTL), went on with his talk “NewSQL, SQL on Hadoop” and compared the query languages, explained how they can be applied to the Hadoop infrastructure and finally gave an overview of NewSQL systems (e.g. VoltDB, Google Spanner). Another local speaker, Anika Groß, a Postdoc at the Leipzig University, focused the talk “NoSQL – Datastores for Big Data” on the different data models and technical models for NoSQL datastores and used the systems Dynamo, an AP system and key-value store, as well as MongoDB, the CP system and document store.

Dr. Anika Groß presenting at the ScaDS Big Data School
Dr. Anika Groß

After the lunch the practical sessions started. In 3 different group the participants could either attend a course on Text Mining, Genome Alignment Processing or Logistics and got an introduction on the system MongoDB.  At the end of the day we went with our guests to a boat/canoe tour and finished the trip with a barbecue.


After this sportive trip we started the next day with presentations on “Distributed Data Processing”. First of all, Prof. Dr. Kai-Uwe Sattler of the TU Ilmenau spoke about Big Data Stream-processing. He gave a survey of the recent processing engines and discussed the different architectures, execution models and programming interfaces. The next speaker Tilmann Rabl, a research director at the Database Systems and Information Management (DIMA) group and technical coordinator of the Berlin Big Data Center (BBDC), introduced the open source system Apache Flink in his talk “Distributed Data Processing and Streaming in Flink”, which allows a faster and more efficient data analysis on both batch and streaming data.

A research assistant of the Center for Information Services and High Performance Computing at the TU Dresden continued. The focus lay on another method in his talk “Introduction to Big Data Analytics on HPC clusters”.  We continued our practical courses with the system Apache Flink. The day ended with a dinner at the “Bayerischer Bahnhof”. The restaurant offered a wide range of international specialties and the locally brewed beer “Gose”.   


On Thursday we welcomed Prof. Dennis Shasha of the New York University. He introduced the topic “Graph Analytics” with his talk “Fast Methods for Finding Colored Motifs in Graphs”. He focused on the problem of finding subgraphs of a network. Next, Vasia Kalavri of the KTH, Stockholm, introduced the Gelly framework in the talk “Graph processing on Apache Flink with the Gelly framework”. She showed how graph analysis task can be expressed using Flink operators and different graph processing models. Another method of graph analytics was presented by Martin Junghanns, a researcher of the Leipzig University. In his talk “Graph Analytics with Gradoop” he explained the functionalities of Gradoop.

The last speaker of the day, Prof. Sören Auer of the University of Bonn, spoke about “(Big) Knowledge Graphs”. He introduced the concept of knowledge graphs based on the RDF and Linked Data paradigm and thematised recent and future Big Knowledge Graph applications and strategies of the combination of Linked Data paradigms and Big Data.  For the last practical sessions of this summer school we introduced the previously mentioned Flink Gelly system.


Finally, the last three speakers discussed different aspect under the topic “Big Data Integration”. Prof. Peter Christen of the Australian National University made the start with his talk “Privacy-Preserving Record Linkage (PPRL)”. By means of real-world scenarios he illustrated the significance of PPRL. Also, he showed how to applicate it on large databases in Big Data environments. In the talk “Blocking for Big Data Integration”, Prof. Themis Palpanas of the Paris Descartes University (France), focused on the blocking-based Entity Resolution and on blocking methods especially for Web Data collections, also giving a little outlook on the future applications. Dr. Maik Thiele, postdoc at the Database Systems Group, finished with his talk “Building the Dresden Web Table Corpus and Beyond”. The presentation was focused on the relational web tables and how to classify them. Identifying different categories to provide a better usability was also part of the talk.

We would like to thank the speakers and guests for making this summer school a success. The evaluation of the feedbacks gave us a good impression of the general perception. Because of this, we have a good knowledge on how we can improve the next time. We hope the guests had a pleasant stay in Leipzig and Leipzig University. We hope to see you again in the next year.

Participants of ScaDS Big Data School in 2016
Participants of ScaDS Summer School 2016

Check out more news about ScaDS.AI Dresden/Leipzig at our Blog.