Title: BeReMo – Benchmark Biochemischer Reasoning Modelle
Project duration: 01.01.2026 – 31.12.2026
Research Area: Bioinformatics/Life Science
Large language models (LLMs) have lately revolutionized biological and biotechnological research. Scientists can use them in agentic frameworks to draft projects, review literature and data, execute stat-of-the-art tools and evaluate results. They also help them to combine knowledge spanning various domains and can accelerate scientific discovery. But they also need to be thoroughly evaluated, because the choice of a specific LLM or reasoning model can deeply influence the AI-agent’s performance.
In BeReMo, we aim to close this gap and build a benchmark for the structured comparison of different LLMs in the context of an AI-agent in biotechnology, developed by the start-up AI-Driven Therapeutics GmbH (AT-DT). ScaDS.AI Dresden/Leipzig contributes domain expertise in AI and data science and supports the methodological design of evaluation strategies. It also provides access to high-performance computing infrastructure for large-scale model benchmarking and analysis.
The objective of BeReMo is to comprehensively analyze LLMs concerning their performance and feasibility in the context of an AI-agent specialized in protein engineering. The specific strengths and weaknesses of representative LLMs and reasoning models are to be assessed concerning their efficiency and accuracy. Ultimately, the benchmark can be used to finetune models for context-dependent LLM usage building on this data, allowing for robust, transparent and trustworthy AI-agent scientist communication.
AI-agents are powerful tools in biotechnological research, allowing scientists to access vast fields of expertise. But as of now there is no comprehensive benchmark evaluating the impact different LLMs have on agent behavior in various topics, e.g. protein engineering. As these can profoundly influence agent output and communication, they need to be evaluated. Closing this gap, this project aims to establish a benchmark addressing the performance of different LLMs in a protein engineering AI-agent developed by the startup AI-DT (the CoScientist).


The project combines LLMs and reasoning models with domain-specific datasets and evaluation pipelines. Benchmarking is based on application-driven test cases, performance metrics, and automated workflows. The system integrates agent-based architectures, parameter optimization, and data-driven evaluation methods, assessing aspects of protein design, like thermostability and affinity enhancement of protein binders.
Ultimately this project aims to find the best performing LLMs and reasoning models for biotechnical applications, enabling the targeted selection of models according to specific use cases. BeReMo will provide a foundation for standardized AI evaluation in biotechnology. Furthermore it will develop a transparent platform for LLM evaluation in the field of protein engineering.
Institute for Drug Discovery
