Resource and energy consumption of transformers-based omics language models

Type of thesis: Masterarbeit / location: Dresden / Status of thesis: Open theses

The use of transformers-based [1] language models in the field of life sciences and medicine has grown exponentially in the last two years [2]. Due to the vocabulary size differing from the typical text alphabet, the large language models pre-trained on the genome, exome, or proteome data, such as DNABERT [3], require further investigation in the direction of their resource and energy consumption. During this project, the tasks will include:

  1. Investigation of the GPU and energy usage of pre-trained omics transformers-based models and, in turn, better understand their similarities and differences to text transformers models.
  2. Explore the differences between language models trained on different omics data types.
  3. Discuss their possible energy and GPU resource utilization improvements.

The supervisors are open to an application for projects other than master thesis.

 

[1] Vaswani et al, “Attention is all you need”, Advances in Neural Information Processing Systems, 2017.

[2] Zhang et al, „Applications of transformer-based language models in bioinformatics: a survey“, Bioinform Adv, 2023.

[3] Ji et al, “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome”, Bioinformatics, 2021.

Counterpart

Neringa Jurenaite

TU Dresden

Machine Learning, Data Analytics, Living Lab

Andrei Politov

TU Dresden

Machine Learning and Data Analytics, Deep Learning, Distributed Machine Learning, HPC Services, Living Lab

TU
Universität
Max
Leibnitz-Institut
Helmholtz
Hemholtz
Institut
Fraunhofer-Institut
Fraunhofer-Institut
Max-Planck-Institut
Institute
Max-Plank-Institut