Title: Machine Learning for the Recognition of Molecules, Metabolites, and Chemical Reactions
Duration: 3 years
Research Area: Mathematical Foundations: Discrete Mathematics and Cheminformatics
When Machine Learning methods are used to produce molecular candidates in areas like Drug Discovery, these can routinely report back unrealistic molecules as their suggestions. We consider this to be related to a fundamental theoretical problem, specifically, to the question of how to determine if an arbitrary molecular candidate-graph can, in fact, model a real molecule or if this is mathematically impossible for it. Given the complexity of the currently known chemical universe, one can see that this is not a simple statistical issue, and that it requires the use of Machine Learning techniques. Here, we propose the use of the Graph Transformation method known as Double Pushout Approach to work as the basis for such a classification system. We study its applications as a novel approach for the in silico construction and exploration of chemical spaces.
We design our Machine Learning algorithms within the paradigm of Explainable Artificial Intelligence, seeking for these to be able to report back in an explicit human-understandable way the aspects that they learn from data. We seek to develop an integrated study on the promising capability of the Double Pushout approach and the Explainable Machine Learning methods.
Two fields with important biochemical applications are Retrosynthetic Analysis and Drug Discovery. Both areas have recently seen benefits from including Machine Learning algorithms into their tools for in silico analysis. One thing that most of these Machine Learning improvements have in common is that they make use of Graph Theory to build computational representations of the candidate molecules they produce, which in principle may not depict real molecules.
To obtain said candidates, these methods rely on training over sets of real molecules retrieved from databases, expecting for this aspect alone to ensure the viability of their suggestions. But in doing so, they ignore one of various fundamental problems that we want to address, namely, how to determine if an arbitrary chemical graph can actually be synthesized, or whether it can only be an unfeasible abstract construct. We consider that at the core of this issue lies the combinatorial constraints that Chemistry naturally imposes over the candidate graphs.
We have reached two main milestones of our project as of today. One of these is the study and Characterization of the Equivalence Between Atom-to-Atom Maps [1, 2], which allows us to determine sets of mathematically “well-behaved” reactions. With this, we can then produce the “alignment of graphs” associated to such reactions. We have implemented, in Python language, a proof of concept of the process of the Graph Alignment, which we describe in a manuscript recently submitted for revision [3, 4]. Such an implementation would be extended to C++ to study cases of greater complexity.
All the computational implementations of our methods are done using the C++ and Python languages and the open-source libraries available for them. All programs produced for this project are made public in Github repositories, containing examples and instructions on how to execute them.
The applications of our project consist in the automatic inference of Graph Transformation Rules and the exploration of Chemical Spaces with the inferred rules, see [5] for details.
Prof. Dr. Peter Stadler