DNA damage and mutations distribute non-randomly over genomes. To learn specificity of mutagenesis we have built Machine Learning cascade models that combine classification and regression tasks. Trained on >6.000 cancer genomes with the immediate sequence context of mutagenesis and metadata, the models predict preferential location of mutations over the genomes’ functional elements. Trainability of mutagenic specificity is highly dependent on tissues, e.g. while in esophagus adenocarcinoma we achieve correlation of 0.88 (Spearman’s R) comparing observed versus predicted mutation rates in coding sequence, in prostate cancer we reach 0.41 and 0.89 over all tissues.
We use these models to interrogate mutagenic behaviors through hypothesis driven simulations. Thus, we found for example that while oxidative DNA damage derived mutagenesis is reduced in coding sequence in esophagus adenocarcinoma, with increasing overall contribution of this mutagenic mechanism, ostensibly unrelated mutation types increasingly accumulate in coding sequence.
Through complementary modelling of DNA-sequence dependent mutagenesis in single nucleotide resolution using bidirectional long-term-short-term memory recurrent Neural Networks (LSTMs), we can pinpoint where such mutations are most likely to occur. Performance of these models currently lie >30-fold over random assignment, even for individual patient samples.
We aim to boost performance further by using as foundation a pretrained Large Language Model, built with the genome itself as language. This approach provides some interesting challenges distinct from “normal” Natural Language Processing (NLP), e.g. how to define “words” within the 3 billion As, Cs, Gs, and Ts? How can we extract the language rules we are learning?
We add a new dimension to cancer genomics through building in silico experimental frameworks and ask basic research questions to large amounts of genomic data, an approach we hope to develop further to extract currently undiscovered information content that genomes carry and to interrogate molecular mechanisms of mutagenesis. Also, we hope that our models will show the value of developing Big-Data driven computational experimental frameworks that in their own right can stand complementary to wet-lab experimental models.
Anna Poetsch moved to Dresden in July 2020. She spent her postdoctoral time at the Francis Crick Institute with a placement to the Okinawa Institute of Science and Technology (OIST).
She did her PhD at the German Cancer Research Institute (DKFZ) and undergraduate training at University Konstanz, the Japanese National Cancer Center Reasearch Institute, and ALTANA Pharma AG.
Anna’s background is in classical biochemistry/ molecular biology, DNA damage response, and mutations in cancer. Her interest in the associated processes has not changed, but the methodology has become increasingly computational, deeper and deeper into Deep Learning.