The use of transformers-based [1] language models in the field of life sciences and medicine has grown exponentially in the last two years [2]. Due to the vocabulary size differing from the typical text alphabet, the large language models pre-trained on the genome, exome, or proteome data, such as DNABERT [3], require further investigation in the direction of their resource and energy consumption. During this project, the tasks will include:
- Investigation of the GPU and energy usage of pre-trained omics transformers-based models and, in turn, better understand their similarities and differences to text transformers models.
- Explore the differences between language models trained on different omics data types.
- Discuss their possible energy and GPU resource utilization improvements.
The supervisors are open to an application for projects other than master thesis.
[1] Vaswani et al, “Attention is all you need”, Advances in Neural Information Processing Systems, 2017.
[2] Zhang et al, „Applications of transformer-based language models in bioinformatics: a survey“, Bioinform Adv, 2023.
[3] Ji et al, “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome”, Bioinformatics, 2021.