Text Mining with MongoDB, Flink and Flink Gelly
Text Mining is the task to extract information out of texts and text collections. One of the many workflows in Text Mining is the extraction of text re-use. These re-used text passages can indicate citations or language similiarities. In this summer school workshop, students will work with a text re-use data set that contains bible citations in the texts from the German Text Archive (Deutsches Text Archiv).
On the first day, the data will be explained in detail and be imported in MongoDB. Students will learn how to work with MongoDB, insert data manually and import data from a given input file. Using the imported text re-use data, the students will learn how to request information based on increasingly difficult queries.
On the second day, texts from the German Text Archive are processed with Apache Flink to calculate various statistics that are useful for Text Mining. Eventually Apache Flinks streaming capabilities are used to calculate statistics using a text streaming service.
On the third day, Gelly is used to analyze the citation graph that is represented in the text re-use data. Students will learn the basic steps that are important when working with Gelly and eventually use Gelly’s Pa(i)gerank algorithm to compute the highest and lowest cited passages in the bible.