Home // Estimating object set sizes with LLMs and species sampling techniques

Supervisor

Prof. Dr. Simon Razniewski

Chair of Knowledge-Aware Artifical Intelligence

TUD Dresden University of Technology

simon.razniewski@tu-dresden.de

Estimating object set sizes with LLMs and species sampling techniques

Status: open / Type of Theses: Master theses / Location: Dresden

The size of many real-world sets, such as “Physicists”, “Pasta types”, or “Hard rock bands” is not easy to determine.

A survey of approaches can be found in [1]. Most notably, species sampling is a technique from ecology, where one repeatedly samples the set, and computes size estimates based on the amount of overlap between samples.

A critical issue is how to obtain such samples. Luggen et al. [2] used edit logs from Wikidata for this purpose, with somewhat underwhelming results. Since then, LLMs have arisen as a very prominent tool for knowledge extraction, and it is easy to obtain multiple samples from an LLM, by prompting it repeatedly with the same prompt, but a higher temperature.

The goal of this thesis is to investigate whether species sampling on LLM outputs provides a promising avenue towards set size estimation. The experiment setup can reuse major components from [2], but with LLMs instead of Wikidata edit logs as input.

References

[1] Razniewski, Simon, et al. “Completeness, Recall, and Negation in Open-World Knowledge Bases: A Survey.” ACM Computing Surveys (2024).

[2] Luggen, Michael, et al. “Non-parametric class completeness estimators for collaborative knowledge graphs—the case of wikidata.” The Semantic Web–ISWC 2019

funded by:

ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.