JavaScript is required to use this site. Please enable JavaScript in your browser settings.

GPTKB: Building Very Large Knowledge Bases from Language Models

Title: GPTKB

Project duration: 8/2024 – open-ended

Research Area: Understanding Language

Logo. GPTKB.

The GPTKB project investigates building a large, general-domain knowledge base (KB) generated entirely from a large language model (LLM). It illustrates the feasibility of creating large-scale knowledge bases from LLMs, addressing key challenges like entity recognition, entity and property canonicalization, and taxonomy construction.

Aims

The project aims to demonstrate that LLMs, specifically GPT-4o-mini, can serve as viable sources for large-scale KB construction, offering novel insights into LLM knowledge representation for NLP. Additionally, it explores new approaches for building general-domain KBs for the Semantic Web.

Problem

Traditional knowledge base construction (KBC) methods are often costly and resource-intensive, requiring complex, manually intensive processes. The project seeks to address these limitations by leveraging LLMs for KB generation, significantly reducing the cost and time needed compared to prior approaches.

Practical Example

We have constructed a prototype KB, using GPT-4o-mini, which contains more than 105M triples for 2.9M entities. A web interface allows users to search for entities by name and view the associated triples. It also supports more advanced queries through SPARQL, and the entire KB can be downloaded as a TTL file (800 MB), making it accessible for a range of research and application purposes.

Screenshot. Web interface of GPTKB.

Technology

At the core of the approach is a prompt for triples for a given subject entity. This prompt is executed recursively, to crawl the KB from the LLM, and in a highly parallelized manner, so as to make the execution possible at million-entity scale. The prompt itself consists of two components: First, the LLM estimates the number of triples it knows for the subject, second, it produces that number of triples. The returned triples are then passed through several canonicalization steps, in particular, concerning entities, relations, classes, and taxonomy construction.

Outlook

GPTKB sets a landmark for NLP and Semantic Web research. It provides a constructive understanding of LLM-based knowledge structures and paves the way for more efficient, scalable KB construction techniques. Future directions may involve enhancing entity accuracy and taxonomy development in KBs created by LLMs.

Publications

  • Yujia Hu, Shrestha Ghosh, Tuan-Phong Nguyen, Simon Razniewski (2024). GPTKB: Building Very Large Knowledge Bases from Language Models. Arxiv: http://arxiv.org/abs/2411.04920

Team

Lead

Team Members

  • Yujia Hu
  • Akash Kumar Gautam
  • Shrestha Ghosh (Tübingen University)
  • Tuan-Phong Nguyen (MPI for Informatics Saarbrücken)

Partners

FAQ

GPTKB is a large-scale, general-domain knowledge base, built entirely using data generated by a large language model. Our team was inspired by the vast knowledge that LLMs possess, as observed by other researchers like (Petroni et al., 2019). We wanted to see if LLMs could be used as a cost-effective source for constructing a comprehensive knowledge base, covering a vast range of general knowledge domains.

Traditional knowledge bases, like Wikidata or DBpedia, are constructed through manual curation or extraction from structured sources, which can be resource-intensive and at the same time limits their scope. GPTKB, however, was generated purely from an LLM and at a fraction of the cost — roughly 100 times less than prior knowledge base construction projects. It contains over 105 million triples about 2.9 million entities, covering general knowledge as contained in the pre-training corpora of the LLM in a more automated and scalable way than traditional methods.

One major benefit is the cost and time efficiency. LLMs are already trained on vast amounts of information, so generating a KB is more about guiding the LLM to extract knowledge systematically. Another benefit is that this approach offers new insights into what LLMs “know” or believe, a field that recently gained popularity as LLMology. GPTKB gives us a window into how the LLM organizes and connects knowledge, which is valuable for NLP research. For the Semantic Web, it also presents a new path forward in tackling the challenge of building broad, general-domain KBs.

The quality of the resulting KB, in particular, the accuracy, is far from existing projects, and from what downstream use cases typically require. In particular, the LLM generates a lot of incorrect facts, and even more facts which cannot be not verified at all (online). For example, for Dresden Zoo, it claims that it houses African Elephants (correct), but also Siberian Tigers (incorrect), and for Karl May, it claims an influence of Mark Twain, which with online research we could neither verify nor refute. I think substantially more work is needed to filter and consolidate the resulting knowledge.

GPTKB has applications in NLP research, where it can provide insights into LLM knowledge organization and serve as a resource for studying LLM knowledge retrieval. In the Semantic Web community, it could support projects needing extensive general-domain data. It’s also designed to be accessible—anyone can search for entities, browse information, and even run SPARQL queries to explore specific relationships.

We are surprised how much knowledge (or beliefs) an LLM of fairly moderate size already possesses. GPT-4o-mini is estimated to possess around 8B parameters, but gave us 105M triples, which makes 76 parameters/triple. We are also surprised by the topical distribution, obtaining lots of triples for patents and historical persons, while less, for instance, on recent scientists.

It’s open to the public at gptkb.org. Users can search for specific entities, view relationships, or download the whole KB in a TTL format. We’ve made the paper and code available as well for those who want to dig deeper into the research and potentially contribute to future developments.

funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.