Home // Research // Applied AI and Big Data // Life Science and Medicine // Projects // Machine Learning-based Gene Annotation in a Genome-wide Context

Contact

Prof. Dr. Peter Stadler

Chair of Bioinformatics

Leipzig University

Peter.Stadler@bioinf.uni-leipzig.de

Machine Learning-based Gene Annotation in a Genome-wide Context

Title: Machine Learning-based Gene Annotation in a Genome-wide Context

Duration: 3 years

Research Area: Genomics, RNA biology, Machine learning, Bioinformatics

Most subdisciplines of biomedical research rely heavily on the annotation and comparison of genes and their function. The problem of accurate annotation of new genomes is however not trivial and usually reliant on an approximation using phylogenetically close species as reference or annotating imperfect transcription data. As Machine Learning methods present themselves as a key technology for the handling of the huge amounts of available genomic data, we implement and develop the Svhip software framework for the training, parameter tuning and utilization of Machine Learning models for genome data classification. The core of the Svhip framework is based on three concepts:

Automatization of all aspects that do not have to be carried out manually, this includes the automated removal of spurious or low information input data and the preprocessing to a unified format.
Interpretability and reproducibility of the output using standardized formats like .csv files and Clustal alignments, as well as automated reporting that can be easily included in other workflows.
Customizability of the learning process itself.

While a number of Machine Learning features were implemented by us for direct usage, the experimenter can in principle define, include and share any number of features for improved detection or covering alternate use cases. As feature engineering is probably the most important aspect of generating reliable models, we will also devise and modify two specialized features from a single-sequence to an alignment-wide usage context. These will focus in particular on the estimation of coding potential. As preliminary testing reveals training set composition as the second important contributing factor for prediction accuracy, further efforts will be made to optimize the automation of the training set composition process. The primary goal here is the balance between a representative data set and the exclusion of redundant instances.

Aims

With the project “Machine Learning-based Gene Annotation in aGenome-wide Context”, we want to improve the overall annotation state of functional RNAs by developing pipelines and toolkits for the automated de novo discovery in a wide range of species and genomes.

Problem

Inferring biological function just from a nucleotide sequence is a problem yet unsolved by modern biology. One approach to mitigating this issue lies in analysing evolutionary trends of conservation: If a given secondary structure element remains largely unchanged despite simultaneous changes in the underlying sequence – i.e. a ‘correction’ of potentially disruptive nucleotide changes happens – this can be considered as a strong indicator of functional importance. Measuring this effect and evaluating its statistical significance is a core goal of this project.

Technology

We use a combination of applied statistics, established bioinformatics tools and machine learning methods, the latter primarily for the classification of prospective genomic areas into coding, noncoding or silent regions. Recently, we also started to use genetic algorithms to evolve synthetic structured RNAs in silico, an approach that yields interesting results for the generation of artificial training and testing sets.

Outlook

We hope that the project “ML-based Gene Annotation in aGenome-wide Context” will contribute significantly to the identification and annotation of yet unannotated genes and the interplay of evolutionary structure conservation and biological function.

Publications

Klapproth, C., Sen, R., Stadler, P. F., Findeiß, S., & Fallmann, J. (2021). Common features in lncRNA annotation and classification: a survey. Non-coding RNA, 7(4), 77.
Corona-Gomez, J. A., Coss-Navarrete, E. L., Garcia-Lopez, I. J., Klapproth, C., Pérez-Patiño, J. A., & Fernandez-Valverde, S. L. (2022). Transcriptome-guided annotation and functional classification of long non-coding RNAs in Arabidopsis thaliana. Scientific reports, 12(1), 14063.
Klapproth, C., Zötzsche, S., Kühnl, F., Fallmann, J., Stadler, P. F., & Findeiß, S. (2023). Tailored machine learning models for functional RNA detection in genome-wide screens. NAR Genomics and Bioinformatics, 5(3), lqad072.

Team

Lead

Prof. Dr. Peter Stadler

Team Members

Christopher Klapproth
Sven Findeiß

funded by:

Gefördert vom Bundesministerium für Bildung und Forschung.

ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.