Automated Question Generation from Wikipedia and Wikidata

Type of thesis: Masterarbeit / location: Dresden / Status of thesis: Open theses

This thesis proposes the development of an innovative automated question generation (AQG) system, leveraging the knowledge available on Wikipedia and the structured data of Wikidata. The topic aims to establish a framework capable of producing a high-quality dataset of questions across a wide range of topics. The focus is on creating a time-sensitive dataset by selecting questions within a specified timespan, thereby ensuring the questions are not only diverse and relevant but also time-dependent. This interdisciplinary aims to integrate natural language processing (NLP), machine learning (ML) techniques, and knowledge graphs.


  • Conduct a thorough literature review of pre-existing datasets to identify gaps and distinguish the proposed methodology from existing approaches.
  • Develop a system for efficiently retrieving and processing data from Wikipedia and Wikidata, with an emphasis on temporal relevance and content filtration.
  • Utilize and assess various NLP and ML models for question generation, including, but not limited to, LLMs and sequential models, to ensure the creation of questions that are both complex and diverse.
  • Classify the generated questions based on their complexity, such as single-hop versus multi-hop questions
    • Expected complexity of the questions from Mintaka dataset [single hop, domain, type of queries]
  • Produce a dataset containing over 10,000 high-quality questions that can be answered using both Wikipedia and Wikidata.
  • Implement a comprehensive evaluation methodology to assess the quality of the generated dataset.

Research Questions

  • How can we efficiently retrieve temporally relevant articles from Wikipedia and corresponding structured data from Wikidata?
  • In what ways can Wikipedia and Wikidata be integrated to generate questions of varying complexity effectively?
  • What methodologies can be adopted to ensure the generated questions meet the desired standards of relevance, accuracy, and complexity?
  • How should the system’s performance be evaluated to ensure the generated content’s quality?


  • Data Retrieval: This involves using APIs to access and retrieve recent and relevant articles from Wikipedia and corresponding data from Wikidata, focusing on advanced filtering mechanisms to ensure content relevance and timeliness.
  • Question Generation: The project will explore a range of NLP and ML methodologies, including utilizing LLMs, to generate questions. The process will account for the complexity of the questions and their alignment with educational standards.
  • Evaluation: A robust evaluation framework will be established, combining automated verification methods with human assessment to validate the questions‘ quality, and relevance. This will include comparisons with existing AQG systems and datasets to benchmark the system’s performance and contributions to the field.

Expected Contributions

  • A novel AQG system that effectively utilizes Wikipedia and Wikidata for generating questions, contributing to the fields of educational technology and AI training.
  • A substantial dataset of over 10,000 questions that spans a variety of topics and complexities, ready for educational use and further research.
  • Insights into the applicability of various NLP and ML techniques for AQG, providing a foundation for future advancements in the field.
  • A comprehensive evaluation of the generated questions, offering a detailed analysis of their educational utility and paving the way for ongoing improvements in AQG systems.


Dr. Sahar Vahdati

TU Dresden

Nature-Inspired Machine Intelligence

Preetam Gattogi

TU Dresden

Nature-Inspired Machine Intelligence