Our group focuses on addressing the methodological challenges that have so far relegated autonomous discovery in the chemical sciences to proof-of-concept studies. We do this in order to:

  • (a) accelerate the discovery of novel functional molecules, materials, and processes;
  • (b) enhance understanding of the underlying chemical and physical processes that govern their behavior; and
  • (c) reallocate human creativity to improving the process of conducting research and leverage AI-driven platforms to execute on that process.

Current Directions

Scientific discovery is a problem of inference from incomplete and imperfect information, for which techniques in artificial intelligence are well-suited. However, there are a number of bottlenecks in our current approach to molecular discovery (a few are summarized below in red); overcoming them will require a number of methodological advances (a few are summarized below in blue).

Develop domain-tailored neural models

The use of machine learning models for chemical discovery problems has overwhelmingly been limited to off-the-shelf models that do not readily transfer to this domain. Problems in chemistry and biology rarely have large quantities of clean, annotated data, and the structure of model inputs (e.g., molecules, proteins) are not inherently numeric or character-based as in more familiar image-or text-processing tasks. While statistical inference techniques have been successfully applied to self-contained chemical prediction tasks (e.g., developing structure-property relationship models, predicting links in knowledge graphs of biological activity), there are significant obstacles to applying them more broadly related to molecular representation, model generalizability, and data availability, among others. Our work in this area involves the integration of domain expertise and prior knowledge into neural models to improve accuracy, uncertainty estimation, and generalization, especially in low data environments. >2D molecular representations that can understand both stereochemistry and conformational flexibility are of particular interest. We are also developing new models for de novo molecular generation that incorporate considerations of synthetic feasibility and sample efficiency.

Improve interpretability in statistical modelling

Use of models in the decision-making process, with or without human supervision, requires trust. Building trust requires a combination of demonstrated value (accuracy) and insight into model reasoning (interpretability). For models to be accepted into discovery pipelines, we must be able to extract justifications for their predictions that are structured in a manner consistent with those of domain experts. We aim to bring such interpretability to our modeling efforts to move beyond black box predictors, for example, by grounding predictions of chemical reactivity in physical organic chemistry principles.

Understand and expand our synthetic chemistry toolbox

Computational assistance is only useful insofar as it can produce actionable suggestions for experimental validation. The synthesis and testing of new molecules is a common requirement for validation of functional physical matter. Despite significant progress in methodology development, few reactions have achieved widespread use as part of the synthetic chemistry toolbox. Newly discovered reactions may be too narrowly applicable or require excessive optimization to be practical. Overcoming both of these barriers to adoption requires an expansion of substrate scope and straightforward guidelines for how the reaction should be carried out to achieve maximal yield, selectivity, etc. We approach this problem from a statistical modelling perspective, leveraging the collective chemical knowledge reflected in the patent and journal literature to identify robust synthetic and biological transformations, including enzyme catalysis. This builds on our ongoing efforts in predictive chemistry, which encompasses computer-aided synthesis planning and related tasks.

Rapidly explore complex design spaces by aligning computational and experimental platforms

As computational design strategies are increasingly able to make use of available information to propose new hypotheses, we are increasingly limited by our ability to validate them. Current platform technologies for synthesis and screening are often misaligned with rational design approaches; the next generation of laboratory hardware should be specifically tailored to supply the data required by the current and future state-of-the-art machine learning and computational discovery algorithms. We customize our own automated laboratory platforms for the rapid generation of high-fidelity experimental data suitable for training and validation.