JavaScript is required to use this site. Please enable JavaScript in your browser settings.

Contact

Theses

Below is a list of available MSc. thesis topics. To apply, please send an email with motivation and transcript to simon.razniewski@tu-dresden.de.

ID Level Topic Keywords
1MSc.Saxon (“Sächsisch”) LLM creationData identification, LLM training
2MSc.Text-based KB construction and querying in the domain of historical battlesTextual information extraction
Knowledge base construction
Information retrieval and question answering
3MSc.Reverse-engineering Knowledge BasesSimulation, KBs (no LLMs!)

.

1. Saxon LLM creation

LLMs speak many languages and dialects, but so far, not convincingly the traditional dialect in and around Dresden, Saxon (“Sächsisch”). The goal of this project is to make an open-weight model such as Llama speak sächsisch. A major challenge is that there are no large ready-to-use corpora in that dialect. The thesis could take one of several routes: (1) extract a training corpus from a large web crawl, e.g., OSCAR or C4, (2) generate a training corpus by using existing strong LLMs for translating German to Saxon, for example via few-shot prompts, (3) using rule-based system prompts to directly translate without tuning the model.

Oscar multilingual dataset / data filtering pipeline: https://oscar-project.org/

For example rules about Saxon, see e.g., https://praxistipps.focus.de/saechsisch-lernen-diese-basics-sollten-sie-wissen_157089

For example news on related approaches in other German states, see https://www.heise.de/meinung/G-schmeidig-wie-a-Brezn-Das-plant-Bayern-mit-seiner-eigenen-KI-9626031.html

.

2. NLP on historical battles

Representation learning, knowledge bases and information retrieval are backbones of AI, yet often struggle with specific use cases that take are of more narrative nature like movies or stories. While those have received decent attention, in this project we focus on another use case, military battles.

Battles form a pivotal element of history, and often take a significant place in collective memory, see, e.g., the memorial of the Normandy Landings, the Battle of Hastings, or Battle of the Nations in Leipzig. Yet so far, they are poorly covered by existing repositories like Wikidata, or LLMs, in particular with respect to modelling their characteristics, course, and similarity.

Examples of poorly supported queries are:

  • Battles similar to the Battle of the Nations
  • Roman-era battles that were most loopsided
  • Sieges unfolding similar as the 1st siege of Vienna
  • List of battles where elephants were involved
  • Battles by narrative (e.g., encirclement, break-through, feigned retreat, successful rearguard action, …)?

Battles are an interesting use case because they involve both a strong narrative element (backstory, motivations, unfolding of events, consequences), as well as a mix of (semi-)structured facets like date, location, involved parties, numeric strength, losses, outcome.

The goal of this project is three-fold:

  1. To develop a representation model for military battles that captures both (semi-)structured facets (keywords, knowledge base), as well as latent ones (embedding)
  2. To populate this model with content at scale, using Wikipedia as source, and LLMs as extractors.
  3. To build and deploy a query interface, and evaluate it compared with direct LLM querying.

These steps should both to advance a specific use case, as well as allow general insights into hybrid knowledge representation and querying in 2025.

.

3. Reverse-engineering knowledge bases

Motivation

    Knowledge bases like Wikidata, DBpedia, and YAGO provide structured data about the world but represent only a biased and incomplete subset of reality. Despite many studies on bias and completeness, little is known about how the selection of entities and facts happens.
    This work aims to model the data sampling process through which information from reality enters a knowledge base, focusing on the role of notability.

    Research Questions

      How can the process of selecting data about reality for KBs be modeled probabilistically?

      Can such a model reproduce key empirical patterns found in Wikidata?

      Can this model be inverted to estimate unbiased real-world statistics?

      Approach

      A two-stage probabilistic model simulates:

        • Entity filtering: deciding if an entity is notable enough to be included.
        • Characteristic filtering: deciding which facts of that entity are recorded.

        Each fact (characteristic) is defined by three notability parameters:

        • Entity notability (influences inclusion),
        • Characteristic self-notability (probability of being recorded),
        • Characteristic all-notability (influence on overall detail).

        Evaluation

        The model will be tested against three statistical phenomena observed in Wikidata:

        • Notability–Wealth Correlation: more notable entities have more facts.
        • Exponential Fact Distribution: number of facts follows an exponential curve.
        • Weak Completeness–Notability Link: some notable properties are incomplete.

        Simulation results will be compared to Wikidata statistics and optimized using stochastic methods to fit real-world patterns.

        Methodology

          Data: Wikidata human entities and biographical facts.

          Implementation: Python simulations and optimization.

          Evaluation metrics: Correlation, distribution fit, completeness measures.

          funded by:
          Gefördert vom Bundesministerium für Bildung und Forschung.
          Gefördert vom Freistaat Sachsen.