JavaScript is required to use this site. Please enable JavaScript in your browser settings.

Project lead

BeReMo

Title: BeReMo – Benchmark Biochemischer Reasoning Modelle

Project duration: 01.01.2026 – 31.12.2026

Research Area: Bioinformatics/Life Science

Large language models (LLMs) have lately revolutionized biological and biotechnological research. Scientists can use them in agentic frameworks to draft projects, review literature and data, execute stat-of-the-art tools and evaluate results. They also help them to combine knowledge spanning various domains and can accelerate scientific discovery. But they also need to be thoroughly evaluated, because the choice of a specific LLM or reasoning model can deeply influence the AI-agent’s performance.

In BeReMo, we aim to close this gap and build a benchmark for the structured comparison of different LLMs in the context of an AI-agent in biotechnology, developed by the start-up AI-Driven Therapeutics GmbH (AT-DT). ScaDS.AI Dresden/Leipzig contributes domain expertise in AI and data science and supports the methodological design of evaluation strategies. It also provides access to high-performance computing infrastructure for large-scale model benchmarking and analysis.

Aims

The objective of BeReMo is to comprehensively analyze LLMs concerning their performance and feasibility in the context of an AI-agent specialized in protein engineering. The specific strengths and weaknesses of representative LLMs and reasoning models are to be assessed concerning their efficiency and accuracy. Ultimately, the benchmark can be used to finetune models for context-dependent LLM usage building on this data, allowing for robust, transparent and trustworthy AI-agent scientist communication.

Problems

AI-agents are powerful tools in biotechnological research, allowing scientists to access vast fields of expertise. But as of now there is no comprehensive benchmark evaluating the impact different LLMs have on agent behavior in various topics, e.g. protein engineering. As these can profoundly influence agent output and communication, they need to be evaluated. Closing this gap, this project aims to establish a benchmark addressing the performance of different LLMs in a protein engineering AI-agent developed by the startup AI-DT (the CoScientist).

How the CoScientist, the protein engineering AI agent developed by AI-DT, works.
Screenshot of a chat with the CoScientist.

Technology

The project combines LLMs and reasoning models with domain-specific datasets and evaluation pipelines. Benchmarking is based on application-driven test cases, performance metrics, and automated workflows. The system integrates agent-based architectures, parameter optimization, and data-driven evaluation methods, assessing aspects of protein design, like thermostability and affinity enhancement of protein binders.

Outlook

Ultimately this project aims to find the best performing LLMs and reasoning models for biotechnical applications, enabling the targeted selection of models according to specific use cases. BeReMo will provide a foundation for standardized AI evaluation in biotechnology. Furthermore it will develop a transparent platform for LLM evaluation in the field of protein engineering.

Team

Lead

Photo from Prof. Dr. Jens Meiler

Prof. Dr. Jens Meiler

Leipzig University

Institute for Drug Discovery

Team Members

Photo from Johanna Möller

Johanna Möller

Leipzig University

Partner

funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.