Title: Machine Learning-based Gene Annotation in a Genome-wide Context
Duration: 3 years
Research Area: Genomics, RNA biology, Machine learning, Bioinformatics
Most subdisciplines of biomedical research rely heavily on the annotation and comparison of genes and their function. The problem of accurate annotation of new genomes is however not trivial and usually reliant on an approximation using phylogenetically close species as reference or annotating imperfect transcription data. As Machine Learning methods present themselves as a key technology for the handling of the huge amounts of available genomic data, we implement and develop the Svhip software framework for the training, parameter tuning and utilization of Machine Learning models for genome data classification. The core of the Svhip framework is based on three concepts:
While a number of Machine Learning features were implemented by us for direct usage, the experimenter can in principle define, include and share any number of features for improved detection or covering alternate use cases. As feature engineering is probably the most important aspect of generating reliable models, we will also devise and modify two specialized features from a single-sequence to an alignment-wide usage context. These will focus in particular on the estimation of coding potential. As preliminary testing reveals training set composition as the second important contributing factor for prediction accuracy, further efforts will be made to optimize the automation of the training set composition process. The primary goal here is the balance between a representative data set and the exclusion of redundant instances.
With the project “Machine Learning-based Gene Annotation in aGenome-wide Context”, we want to improve the overall annotation state of functional RNAs by developing pipelines and toolkits for the automated de novo discovery in a wide range of species and genomes.
Inferring biological function just from a nucleotide sequence is a problem yet unsolved by modern biology. One approach to mitigating this issue lies in analysing evolutionary trends of conservation: If a given secondary structure element remains largely unchanged despite simultaneous changes in the underlying sequence – i.e. a ‘correction’ of potentially disruptive nucleotide changes happens – this can be considered as a strong indicator of functional importance. Measuring this effect and evaluating its statistical significance is a core goal of this project.
We use a combination of applied statistics, established bioinformatics tools and machine learning methods, the latter primarily for the classification of prospective genomic areas into coding, noncoding or silent regions. Recently, we also started to use genetic algorithms to evolve synthetic structured RNAs in silico, an approach that yields interesting results for the generation of artificial training and testing sets.
We hope that the project “ML-based Gene Annotation in aGenome-wide Context” will contribute significantly to the identification and annotation of yet unannotated genes and the interplay of evolutionary structure conservation and biological function.