Supervisor

Resource and energy consumption of transformers-based omics language models

Status: open / Type of Theses: Master theses / Location: Dresden

The use of transformers-based [1] language models in the field of life sciences and medicine has grown exponentially in the last two years [2]. Due to the vocabulary size differing from the typical text alphabet, the large language models pre-trained on the genome, exome, or proteome data, such as DNABERT [3], require further investigation in the direction of their resource and energy consumption. During this project, the tasks will include:

  1. Investigation of the GPU and energy usage of pre-trained omics transformers-based models and, in turn, better understand their similarities and differences to text transformers models.
  2. Explore the differences between language models trained on different omics data types.
  3. Discuss their possible energy and GPU resource utilization improvements.

 

[1] Vaswani et al, “Attention is all you need”, Advances in Neural Information Processing Systems, 2017.

[2] Zhang et al, “Applications of transformer-based language models in bioinformatics: a survey”, Bioinform Adv, 2023.

[3] Ji et al, “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome”, Bioinformatics, 2021.

funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.