Software

We are generally committed to providing open-source software and tools for the scientific community. Our group Github organization can be found here, and we highlight a number of software tools below.


Research Themes: (select to filter) molecular representationdesign and optimizationpredictive chemistrymetabolomicsautomationdata

ASKCOS

Web App: https://askcos.mit.edu
Documentation: https://askcos-docs.mit.edu/
GitLab: https://gitlab.com/mlpds_mit/askcosv2


ASKCOS (Automated System for Knowledge-based Continuous Organic Synthesis) is an open source software suite hosting several data-driven chemistry tools for synthesis planning. It has seen use by 35,000+ chemists and is deployed at 15+ pharmaceutical and chemical companies. At the above link, we host a free-to-use web app deployment of ASKCOS.

predictive chemistry

FlowER

GitHub: https://github.com/FongMunHong/FlowER


FlowER uses generative flow matching to model chemical reaction as a process of electron redistribution, conceptually aligns with arrow-pushing formalisms. It aims to capture the probabilistic nature of reactions with mass conservation where multiple outcomes are reached through branching mechanistic networks evolving in time.

molecular representationpredictive chemistry

ICEBERG

GitHub: https://github.com/coleygroup/ms-pred


ICEBERG is a geometric deep learning model that simulates collision-induced dissociation in mass spectrometry to generate chemically plausible fragments and their relative intensities with awareness of collision energies and polarities. We utilize ICEBERG predictions to facilitate structure elucidation by ranking a set of candidate structures based on the similarity between their predicted in silico MS/MS spectra and an experimental MS/MS spectrum of interest.

metabolomics

DiffMS

GitHub: https://github.com/coleygroup/diffms


DiffMS is a generative discrete graph diffusion model for structure elucidation from mass spectra. It is a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula.

metabolomicsmolecular representation

MARASON

GitHub: https://github.com/coleygroup/ms-pred


MARASON is a retrieval-augmented mass spectrum simulator with neural graph matching. We retrieve reference structures and spectra from a reference library based on similarity to the target structure. It is a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network to improve Retrieval Augmented Generation.

metabolomics

MIST-CF

GitHub: https://github.com/samgoldman97/mist-cf


MIST-CF (Metabolite Inference with Spectrum Transformers - Chemical Formula) is an extension of the MIST framework that infers precursor chemical formulae directly from tandem mass spectra in a de novo setting. Using a transformer-based energy model, it ranks candidate formulae without requiring reference spectral databases.

metabolomics

Higher-Level Retrosynthesis

GitHub: https://github.com/jihye-roh/higherlev_retro


Higher-Level Retrosynthesis introduces a novel framework for computer-aided retrosynthesis that focuses on the broader retrosynthetic strategy. We abstract the detailed substructures in the intermediates of a synthetic pathway that do not appear in the target product, allowing our algorithm to emphasize the general, higher-level strategies instead of considering the specifics of chemically-equivalent functional groups.

predictive chemistry

SynFormer

GitHub: https://github.com/coleygroup/synformer


SynFormer is a generative artificial intelligence model for navigating synthesizable chemical space. It is designed to generate molecules along with their synthetic routes in a scalable manner. It enables (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle.

design and optimizationmolecular representation

ShEPhERD

GitHub: https://github.com/coleygroup/shepherd


ShEPhERD is a generative diffusion model for generating molecules conditioned on explicit representations of shape, electrostatics, and pharmacophores. It is applicable to bioisosteric drug design such as natural product ligand hopping, bioactive hit diversification, and bioisosteric fragment merging.

molecular representationdesign and optimization

RDChiral

GitHub: https://github.com/connorcoley/rdchiral


RDChiral is a tool for RDKit’s RunReactants to improve stereochemistry handling.

molecular representationpredictive chemistry

RDCanon

GitHub: https://github.com/coleygroup/rdcanon


RDCanon is a package designed for canonicalizing SMARTS and Reaction SMARTS templates. It reorders SMARTS to optimize querying speed such as for database search. This optimization is invariant of atom mapping.

datapredictive chemistry

Virtual Flask

GitHub: https://github.com/coleygroup/virtual_flask


The Virtual Flask is a cheminformatic application designed to generate reaction networks given a set of initial molecules and a corpus of templates and enables reaction prediction. A suite of supporting tools facilitate the analysis and visualization of generated chemical reaction data. This guide provides detailed instructions for setting up and running the Virtual Flask locally for development purposes. Both the Corpus and Analysis dashboards can be hosted locally and run simultaneously.

automationpredictive chemistry

SPARROW

GitHub: https://github.com/coleygroup/sparrow


A workflow to simultaneously select molecules and their synthetic routes for lead optimization and design-make-test cycles. This optimization approach aims to minimize synthetic cost while selecting the molecules that are most likely to fulfill design constraints. More details about SPARROW can be found in our preprint.

design and optimizationpredictive chemistry

qPO

GitHub: https://github.com/jenna-fromer/qPO


qPO (multipoint Probability of Optimality) is an acquisition strategy for discrete optimization that is motivated by pure exploitation. It maximizes the probability that the batch includes the true optimum, which is expressed as the sum over individual acquisition scores and thereby circumvents the combinatorial challenge of optimizing a batch acquisition function

predictive chemistrydesign and optimizationautomation

EASIE

GitHub: https://github.com/itai-levin/easie


EASIE (Exploration of chemical Analog Space, Implicitly and Explicitly) is a computational workflow to (1) score and select retrosynthetic pathways on the basis of perceived route diversifiability, (2) estimate property distributions of the resulting enumerated space to inform or constrain reactant selection, and (3) enumerate synthetic pathway-constrained analogs via selection of suitable alternative reactants.

predictive chemistry

PMO

GitHub: https://github.com/wenhao-gao/mol_opt


PMO (Practical Molecular Optimization) is a open-source benchmark to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 25 molecular design algorithms on 23 tasks with a particular focus on sample efficiency (oracle calls).

design and optimization

PyScreener

GitHub: https://github.com/coleygroup/pyscreener


A pythonic interface to high-throughput virtual screening software. It was designed to have a minimal interface under the principal that a high-throughput virtual screen is intended to be a broad strokes technique to gauge ligand favorability. It supports Vina, Quick-Vina, Smina, Psovina, and DOCK6.

automationmolecular representation

MolPAL

GitHub: https://github.com/coleygroup/molpal


MolPAL is a software for the accelerated discovery of compounds in high-throughput virtual screening environments. Bayesian optimization algorithms for docking-based virtual screening decreases the computational cost of using docking to identify the majority of top-scoring compounds in virtual libraries by over an order of magnitude, aided by surrogate machine learning models that prioritize the screening of molecules that are associated with better docking scores.

design and optimization

Graph2Smiles

GitHub: https://github.com/coleygroup/Graph2SMILES


Graph2Smiles is a graph-to-sequence model that predicts the SMILES string of a reaction product given a reaction template and a reactant molecule for one-step retrosynthesis.

predictive chemistry

ROGI (-XD)

GitHub (ROGI): https://github.com/coleygroup/rogi
GitHub (ROGI-XD): https://github.com/coleygroup/rogi-xd


ROGI is a quantitative metric for describing the roughness of molecular property landscapes, inspired by fractal dimension, which helps characterize the presence of activity cliffs and correlates with the out-of-sample error of machine learning models in molecular property prediction. ROGI-XD extends this framework to evaluate the roughness of learned representations from chemical foundation models, enabling assessment of how well these models capture the geometric complexity of molecular property landscapes.

molecular representation

Polymer Chemprop

GitHub: https://github.com/coleygroup/polymer-chemprop


Polymer Chemprop adapts the Chemprop message-passing neural network architecture for predicting polymer properties. It supports training and evaluation workflows tailored to polymer datasets, enabling accurate property prediction and materials design.

molecular representationpredictive chemistry

Library Synthesis Tools

GitHub: https://github.com/qai222/libsyn_tools


libsyn_tools is a set of Python utilities for designing and analyzing library synthesis campaigns, including enumeration, plate mapping, and data processing workflows.

automationdesign and optimization

Pipette Scheduling

GitHub: https://github.com/wuRoy/CVRP_pipette_scheduling


CVRP Pipette Scheduling formulates high-throughput liquid-handling operations as a capacitated vehicle routing problem to minimize execution time on multi-channel pipetting robots.

automation

QM-augmented GNN

GitHub: https://github.com/coleygroup/QM-augmented_GNN


QM-augmented GNN integrates quantum-mechanical descriptors with graph neural networks to improve molecular property prediction and reaction-barrier estimation. The repository provides code to featurize molecules with QM calculations and train augmented models.

molecular representationpredictive chemistry

SCARF

GitHub: https://github.com/samgoldman97/ms-pred


SCARF (Prefix-Tree Decoding for Predicting Mass Spectra from Molecules) uses an intermediate strategy for predicting mass spectra from molecules by treating mass spectra as sets of molecular formulae. After first encoding an input molecular graph, we decode a set of molecular subformulae, each of which specify a predicted peak in the mass spectrum, the intensities of which are predicted by a second model. The model has been depreciated in favor of a new model “ICEBERG”.

metabolomics

SynNet

GitHub: https://github.com/wenhao-gao/SynNet


SynNet is an amortized approach to generate synthetic pathways as a Markov decision process conditioned on a target molecular embedding. This approach allows us to conduct synthesis planning in a bottom-up manner and design synthesizable molecules by decoding from optimized conditional codes. The approach leverages neural networks to probabilistically model the synthetic trees, one reaction step at a time, according to reactivity rules encoded in a discrete action space of reaction templates. The model has been depreciated in favor of a new model “SynFormer”.

design and optimizationmolecular representation

SQUID

GitHub: https://github.com/keiradams/SQUID


SQUID is a multimodal 3D generative model that enables shape-conditioned 3D molecular design by equivariantly encoding molecular shape and variationally encoding chemical identity. We ensure local geometric and chemical validity of generated molecules by using autoregressive fragment-based generation with heuristic bonding geometries, allowing the model to prioritize the scoring of rotatable bonds to best align the growing conformational structure to the target shape. The model has been depreciated in favor of a new model “ShEPhERD”.

molecular representationdesign and optimization


Open software that we contribute to

Open Reaction Database

Website: https://open-reaction-database.org
GitHub: https://github.com/Open-Reaction-Database


The Open Reaction Database (ORD) is a community-driven repository and schema for storing and sharing machine-readable organic reaction data to accelerate synthesis planning, reaction prediction, and experiment design.

datapredictive chemistry

ChemProp2

GitHub: https://github.com/chemprop/chemprop


ChemProp v2 is a message-passing neural network framework for molecular property prediction that supports modern architectures, uncertainty estimation, and interpretability for applications in drug discovery and materials science.

molecular representationpredictive chemistry

Therapeutics Data Commons (TDC)

Website: https://tdcommons.ai
GitHub: https://github.com/mims-harvard/TDC


TDC is an open-science platform providing machine-learning-ready datasets, benchmarks, and leaderboards covering diverse therapeutic modalities and stages of drug discovery, facilitating reproducible and scalable AI research.

datapredictive chemistrydesign and optimization

OpenChemIE

GitHub: https://github.com/CrystalEye42/OpenChemIE


OpenChemIE is an open-source toolkit for chemistry information extraction, providing models and pipelines to automatically identify and convert chemical structures, reactions, and entities from text and figures in scientific literature.

datamolecular representation