Below is a list of available MSc. thesis topics. To apply, please send an email with motivation and transcript to simon.razniewski@tu-dresden.de.
| ID | Level | Topic | Keywords |
| 1 | MSc. | Saxon (“Sächsisch”) LLM creation | Data identification, LLM training |
| 2 | MSc. | Text-based KB construction and querying in the domain of historical battles | Textual information extraction Knowledge base construction Information retrieval and question answering |
| 3 | MSc. | Reverse-engineering Knowledge Bases | Simulation, KBs (no LLMs!) |
.
1. Saxon LLM creation
LLMs speak many languages and dialects, but so far, not convincingly the traditional dialect in and around Dresden, Saxon (“Sächsisch”). The goal of this project is to make an open-weight model such as Llama speak sächsisch. A major challenge is that there are no large ready-to-use corpora in that dialect. The thesis could take one of several routes: (1) extract a training corpus from a large web crawl, e.g., OSCAR or C4, (2) generate a training corpus by using existing strong LLMs for translating German to Saxon, for example via few-shot prompts, (3) using rule-based system prompts to directly translate without tuning the model.
Oscar multilingual dataset / data filtering pipeline: https://oscar-project.org/
For example rules about Saxon, see e.g., https://praxistipps.focus.de/saechsisch-lernen-diese-basics-sollten-sie-wissen_157089
For example news on related approaches in other German states, see https://www.heise.de/meinung/G-schmeidig-wie-a-Brezn-Das-plant-Bayern-mit-seiner-eigenen-KI-9626031.html
.
2. NLP on historical battles
Representation learning, knowledge bases and information retrieval are backbones of AI, yet often struggle with specific use cases that take are of more narrative nature like movies or stories. While those have received decent attention, in this project we focus on another use case, military battles.
Battles form a pivotal element of history, and often take a significant place in collective memory, see, e.g., the memorial of the Normandy Landings, the Battle of Hastings, or Battle of the Nations in Leipzig. Yet so far, they are poorly covered by existing repositories like Wikidata, or LLMs, in particular with respect to modelling their characteristics, course, and similarity.
Examples of poorly supported queries are:
Battles are an interesting use case because they involve both a strong narrative element (backstory, motivations, unfolding of events, consequences), as well as a mix of (semi-)structured facets like date, location, involved parties, numeric strength, losses, outcome.
The goal of this project is three-fold:
These steps should both to advance a specific use case, as well as allow general insights into hybrid knowledge representation and querying in 2025.
.
3. Reverse-engineering knowledge bases
Motivation
Knowledge bases like Wikidata, DBpedia, and YAGO provide structured data about the world but represent only a biased and incomplete subset of reality. Despite many studies on bias and completeness, little is known about how the selection of entities and facts happens.
This work aims to model the data sampling process through which information from reality enters a knowledge base, focusing on the role of notability.
Research Questions
How can the process of selecting data about reality for KBs be modeled probabilistically?
Can such a model reproduce key empirical patterns found in Wikidata?
Can this model be inverted to estimate unbiased real-world statistics?
Approach
A two-stage probabilistic model simulates:
Each fact (characteristic) is defined by three notability parameters:
Evaluation
The model will be tested against three statistical phenomena observed in Wikidata:
Simulation results will be compared to Wikidata statistics and optimized using stochastic methods to fit real-world patterns.
Methodology
Data: Wikidata human entities and biographical facts.
Implementation: Python simulations and optimization.
Evaluation metrics: Correlation, distribution fit, completeness measures.