Our group focuses on addressing the methodological challenges that have so far relegated autonomous discovery in the chemical sciences to proof-of-concept studies. We work on both computational and experimental techniques (with a strong emphasis on the former) to enable autonomous molecular discovery at the intersection of chemistry, data science, machine learning, and laboratory automation with an emphasis on small molecule drug discovery. Our long-term objectives are to:
- (a) accelerate the discovery of novel functional molecules, materials, and synthetic processes;
- (b) enhance understanding of the underlying chemical and physical processes that govern their behavior; and
- (c) reallocate human creativity to improving the process of conducting research and leverage AI-driven platforms to execute on that process.
The identification and synthesis of molecules that exhibit a desired function is an essential part of addressing contemporary problems in science and technology. Small molecules are the predominant modality for medicines, chemical probes, organocatalysts, and specialty monomers among others. They are typically discovered through an iterative process of designing candidate compounds, synthesizing them, and testing their performance, where each repeat of the cycle requires weeks or months. The rate at which this process yields successful compounds can be limited by bottlenecks and mispredictions at all three stages and is plagued by inefficiencies, including underutilization of available data resulting in inadequate predictions of compound performance, compound selection based on intuition or synthetic ease rather than information content, and frequent manual intervention subject to human bias. For example, the hit-to-lead and lead optimization stages of small molecule drug discovery, while only part of the overall pipeline, require several years and millions of dollars for each clinical candidate. Even longer timescales—typically >10 years—are required to bring a new material to market.
Scientific discovery is a problem of inference from incomplete and imperfect information, for which techniques in artificial intelligence are well-suited. However, there are a number of bottlenecks in our current approach to molecular discovery (a few are summarized below in red); overcoming them will require a number of methodological advances (a few are summarized below in blue).
For a comprehensive overview of autonomous discovery in the chemical sciences, including a discussion of key challenges, please read our 2019 review article and perspective in Angewandte Chemie or on arxiv: part one, part two.
Domain-tailored neural models
The use of machine learning models for chemical discovery problems has overwhelmingly been limited to off-the-shelf models that do not readily transfer to this domain. Problems in chemistry and biology rarely have large quantities of clean, annotated data, and the structure of model inputs (e.g., molecules, proteins) are not inherently numeric or character-based as in more familiar image-or text-processing tasks. While statistical inference techniques have been successfully applied to self-contained chemical prediction tasks (e.g., developing structure-property relationship models, predicting links in knowledge graphs of biological activity), there are significant obstacles to applying them more broadly related to molecular representation, model generalizability, and data availability, among others. Our work in this area involves the integration of domain expertise and prior knowledge into neural models to improve accuracy, uncertainty estimation, and generalization, especially in low data environments. >2D molecular representations that can understand both stereochemistry and conformational flexibility are of particular interest. Beyond small molecules, we also work on representation learning for molecular ensembles (e.g., polymers) and porous materials (e.g., zeolites) in collaborative work.
- Adams, K., Pattanaik, L., Coley, C. W. Learning 3D representations of molecular chirality with invariance to bond rotations. ICLR (2022).
- Pattanaik, L., Ganea, O. E., Coley, I., Jensen, K. F., Green, W. H., Coley, C. W. Message passing networks for molecules with tetrahedral chirality. NeurIPS ML4Molecules, arxiv:2012.00094. (2020).
- Zhang, X., Wang, L., Helwig, J., Luo, Y., Fu, C., Xie, Y., …Adams, K., …, Coley, C.W., Qian, X., Qian, X., Smidt, T., Ji, S. “Artificial intelligence for science: Quantum, atomistic, and continuum systems” arxiv: 2307.08423 (2023).
- Soleimany, A. P.*, Amini, A.*, Goldman, S.*, Rus, D., Bhatia, S., Coley, C. W. Evidential deep learning for guided molecular property prediction and discovery NeurIPS ML4Molecules (2020) & ACS Cent. Sci. 7(8) 1356-1367 (2021).
- Hirschfeld, L., Swanson, K., Yang, K., Barzilay, R., Coley, C. W. Uncertainty quantification using neural networks for molecular property prediction. J. Chem. Inf. Model. 60(8), 3770-3780 (2020).
Computer-aided molecular design
One of the most important questions one must answer when searching for a molecule that achieves a certain property profile is ‘what should we make next?’. Molecules are generally proposed through one of two complementary approaches: virtual screening, where one has a fixed list of candidates, and generative modeling, where one uses algorithms to propose novel molecular structures. Exhaustive virtual screening with a computational oracle (e.g., docking for structure-based drug design) is a viable approach to hit finding, but is not straightforward to apply to the tens of billions of molecules in modern virtual libraries. We work on various methods for selecting and proposing molecular structures during iterative rounds of optimization, both for from discrete libraries and using generative models. The former uses techniques in model-guided optimization (Bayesian optimization) to navigate large discrete design spaces of candidate molecular structures. A thread of particular interest is the development of synthesizability-constrained generative models that propose molecular structures that are synthetically accessible. Sample efficiency is also a primary concern for these moodels, as in silico benchmarks may involve the ‘testing’ of hundreds of thousands of molecules, which is severely misaligned with what is practical, experimentally.
- Graff, D. E., Shakhnovich, E. I., Coley, C. W. Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem. Sci. 12, 7866-7881 (2021).
- Graff, D. E., Aldeghi, M., Marrone, J. A., Jordan, K. E., Pyzer-Knapp, E. O., Coley, C. W. “Self-focusing virtual screening with active design space pruning” arxiv: 01753 & J. Chem. Inf. Model. (2022).
- Gao, W., Coley, C. W. The synthesizability of molecules proposed by generative models. J. Chem. Inf. Model. 60(12) 5714–5723 (2020).
- Gao, W., Mercado, R., Coley, C. W. Amortized tree generation for bottom-up synthesis planning and synthesizable molecular design. ICLR [Spotlight] (2022).
- Gao, W., Fu, T., Sun, J. Coley, C. W. “Sample efficiency matters: A benchmark for practical molecular optimization” arxiv: 2206.12411 & NeurIPS (2022)
- Coley, C. W. Defining and exploring chemical spaces. Trends in Chemistry DOI: 10.1016/j.trechm.2020.11.004 (2020).
Data-driven predictive chemistry
Computational assistance is only useful insofar as it can produce actionable suggestions for experimental validation. The synthesis and testing of new molecules is a common requirement for validation of functional physical matter. Despite significant progress in methodology development, few reactions have achieved widespread use as part of the synthetic chemistry toolbox. Newly discovered reactions may be too narrowly applicable or require excessive optimization to be practical. Overcoming both of these barriers to adoption requires an expansion of substrate scope and straightforward guidelines for how the reaction should be carried out to achieve maximal yield, selectivity, etc. We approach this problem from a statistical modelling perspective, leveraging the collective chemical knowledge reflected in the patent and journal literature to identify robust synthetic and biological transformations, including enzyme catalysis. This builds on our ongoing efforts in ‘predictive chemistry’, which encompasses computer-aided synthesis planning, reaction outcome prediction, and related tasks. Our work in this area also explores the intersection of structure-based representation learning and descriptor-based approaches that have become prevalent in physical organic chemistry. A particular challenge is transforming these methods from being qualitative to quantitative, and retrospective to prospective.
- Tu, Z., Stuyver, T., Coley, C. W. “Predictive chemistry: Machine learning for reaction deployment, reaction development, and reaction discovery” Sci. 14, 226-244 (2023).
- Levin, I., Fortunato, M.E., Tan, K.L., Coley, C.W. “Computer-aided evaluation and exploration of chemical spaces constrained by reaction pathways” AIChE J. DOI: 10.1002/aic.18234 (2023).
- Stuyver, T., Coley C. W. Quantum chemistry-augmented neural networks for reactivity prediction: Performance, generalizability and interpretability. J. Chem. Phys. 156, 084104 (2022).
- Tu, Z., Coley, C. W. “Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction” arxiv: 09681 (2021) & J. Chem. Inf. Model. 62(15) 3503-3513 (2022).
- Goldman, S., Das, R., Yang, K. K., Coley C. W. Machine learning modeling of family wide enzyme-substrate specificity screens. PLOS Comp. Bio. (2022).
- Lin, M.-H., Tu, Z., Coley, C. W. Improving the performance of models for one-step retrosynthesis through re-ranking. J. Cheminform. 14(15) (2022).
Synthetic chemistry automation
As computational design strategies are increasingly able to make use of available information to propose new hypotheses, we are increasingly limited by our ability to validate them. More bluntly, synthesis is the bottleneck for many discovery workflows. Current platform technologies for synthesis and screening are often misaligned with rational design approaches; the next generation of laboratory hardware should be specifically tailored to supply the data required by the current and future state-of-the-art machine learning and computational discovery algorithms. We customize our own automated laboratory platforms for the rapid generation of high-fidelity experimental data suitable for training and validation. Our intent is to integrate automated laboratory hardware with our computational models to autonomously conduct reaction and molecule screening campaigns, inching closer to a true ‘closed-loop’ vision beyond the proofs of concept the field has seen to date. Aspects of this work include techniques for structural elucidation, with application to both reaction screening and metabolomics workflows, and an investigation of how synthetic cost considerations should be factored into batched molecular design to maximize information gain relative to experimental effort.
- Gao, W., Raghavan, P., Coley, C. W. Autonomous platforms for data-driven organic synthesis. Nat. Commun. 13, 1075 (2022).
- Wierenga, R.P., Golas, S., Ho, W., Coley, C.W., Elsvelt, K.M. “PyLabRobot: An open-source hardware agnostic interface for liquid-handling robots and accessories” bioRxiv: 2023.07.10.547733 (2023).
- More soon!
Data sharing and standardization
The development of data-driven methods relies on data. At present, most procedural details about chemical reactions are reported in unstructured supporting information documents as Word files exported to PDFs. Information is not fully captured by current databasing efforts including, even basic details of quantitative amounts. The state of data sharing for other molecular property prediction tasks is better, but efforts to standardize data splits, evaluations, and pre-processing pipelines are essential. In addition to our commitment to developing open source software for the research threads listed above, we help lead the Open Reaction Database and contribute to the Therapeutics Data Commons.
- Mercado, R., Kearnes, S.M., Coley, C.W. “Data sharing in chemistry: Lessons learned and a case for mandating structured reaction data” Chem. Inf. Model. DOI: 10.1021/acs.jcim.3c00607 (2023).
- Kearnes, S. M., Maser, M. R., Wleklinski, M., Kast, A., Doyle, A. G., Dreher, S. D., Hawkins, J. M., Jensen, K. F. Coley, C. W. The Open Reaction Database. J. Am. Chem. Soc. DOI: 10.1021/jacs.1c09820 (2021).
- Maloney, M.P., Coley, C.W., Genheden, S., Carson, N., Helquist, P., Norrby, P.-O., Wiest, O. “Negative data in data sets for machine learning training” Lett. & J. Org. Chem. (2023).
- Huang, K., Fu, T., Gao, W., Zhao, Y., Roohani, Y., Leskovec, J., Coley, C. W., Xiao, C., Sun, J., Zitnik, M. Therapeutics Data Commons: Machine learning datasets and tasks for therapeutics. arXiv:2102.09548 & NeurIPS (2021).