Status: finished / Type of Theses: Master theses / Location: Leipzig
The process of extraction, transforming and loading data in a data warehouse is a crucial step in the construction of a data warehouse. In times of Big Data new ETL-Frameworks emerged that are able to handle very large amounts of data and to process this data in a distributed fashion. Popular examples are Pentaho http://www.pentaho.com, Alterx http://www.alteryx.com/analytics/big-data-etl, or KNIME. These framework allow to construct comprehensive ETL-Jobs often supported by visual Workflow tools that are then executed on a big data infrastructure.
In this thesis we would like to get an overview to the State of the Art in ETL for Big Data. We would like pursue a functional comparison of existing offers and select two frameworks for implementing an exemplary ETL-Pipeline. The student could build the ETL for bibliographic data from DBPL, Microsoft Acadmic Graph and similar sources, but also other application areas could be tackled.
The work includes the following subtasks:
We promise close supervision by members of the Big Data Center ScaDS. In some cases we could offer student positions before or after the thesis to dive into the topic.
Contact: