JavaScript is required to use this site. Please enable JavaScript in your browser settings.

Contact

Decorative Header Image

Data Sources & Models

In a commitment to open data beyond providing infrastructure for data and model sharing, we have also released several large-scale datasets, including:

  • the Archive Query Log (AQL), a comprehensive query log collected at the Internet Archive over the last 25 years; it includes 357 million queries, 306 million search result pages, and 2.6 billion search results across 550 search providers, making it the largest publicly released query log to date (https://www.tira.io/task/archive-query-log).
  • the Webis-STEREO-21 dataset, the largest collection of scientific text reuse in open-access publications which contains more than 91 million cases of reused text passages found in 4.2 million unique open-access publications, with a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case (https://webis.de/data/webis-stereo-21.html).

We have further open-sourced several codebases related to model training and information retrieval, including:

Team

  • Harrisen Scells
  • Christopher Schröder
  • Lukas Gienapp
  • Prof. Dr. Martin Potthast (ScaDS.AI)
funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.