Evaluation and Comparison of Big Data ETL Frameworks

Type of thesis: Masterarbeit / location: Leipzig / Status of thesis: Finished theses

The process of extraction, transforming and loading data in a data warehouse is a crucial step in the construction of a data warehouse. In times of Big Data new ETL-Frameworks emerged that are able to handle very large amounts of data and to process this data in a distributed fashion. Popular examples are Pentaho http://www.pentaho.com, Alterx http://www.alteryx.com/analytics/big-data-etl, or KNIME. These framework allow to construct comprehensive ETL-Jobs often supported by visual Workflow tools that are then executed on a big data infrastructure.

In this thesis we would like to get an overview to the State of the Art in ETL for Big Data. We would like pursue a functional comparison of existing offers and select two frameworks for implementing an exemplary ETL-Pipeline. The student could build the ETL for bibliographic data from DBPL, Microsoft Acadmic Graph and similar sources, but also other application areas could be tackled.

The work includes the following subtasks:

  • overview of related work in ETL and also ELT for Big Data
  • Functional Comparison of existing tools and offers
  • Selection of 2 Candidates and implementation of an exemplary ETL-Pipeline and their evaluation

We promise close supervision by members of the Big Data Center ScaDS. In some cases we could offer student positions before or after the thesis to dive into the topic.




Dr. Eric Peukert

Stellvertretender Geschäftsführer

Fakultät Mathematik und Informatik

Universität Leipzig