Related papers: Holistic chemical evaluation reveals pitfalls in reaction prediction models

Holistic chemical evaluation reveals pitfalls in reaction prediction models

URL: http://arxiv.org/abs/2312.09004v1
Date: Thu, 14 Dec 2023 14:54:28 GMT
Title: Holistic chemical evaluation reveals pitfalls in reaction prediction models
Authors: Victor Sabanza Gil, Andres M. Bran, Malte Franke, Remi Schlama, Jeremy S. Luterbacher, Philippe Schwaller
Abstract summary: We propose a new assessment scheme that builds on current approaches, steering towards a more holistic evaluation. ChoRISO is a curated dataset along with multiple tailored splits to recreate chemically relevant scenarios. Our work paves the way towards robust prediction models that can ultimately accelerate chemical discovery.
Score: 0.3065062372337749
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prediction of chemical reactions has gained significant interest within the machine learning community in recent years, owing to its complexity and crucial applications in chemistry. However, model evaluation for this task has been mostly limited to simple metrics like top-k accuracy, which obfuscates fine details of a model's limitations. Inspired by progress in other fields, we propose a new assessment scheme that builds on top of current approaches, steering towards a more holistic evaluation. We introduce the following key components for this goal: CHORISO, a curated dataset along with multiple tailored splits to recreate chemically relevant scenarios, and a collection of metrics that provide a holistic view of a model's advantages and limitations. Application of this method to state-of-the-art models reveals important differences on sensitive fronts, especially stereoselectivity and chemical out-of-distribution generalization. Our work paves the way towards robust prediction models that can ultimately accelerate chemical discovery.

Related papers

Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z)
Interpretable Deep Learning for Polar Mechanistic Reaction Prediction [43.95903801494905]
We introduce PMechRP (Polar Mechanistic Reaction Predictor), a system that trains machine learning models on the PMechDB dataset. We train compare a range of machine learning models, including transformer-based, graph-based and two-step siamese architectures. Our best-performing approach was a hybrid model, which combines a 5-ensemble of Chemformer models with a two-step Siamese framework.
arXiv Detail & Related papers (2025-04-22T02:31:23Z)
Chemical knowledge-informed framework for privacy-aware retrosynthesis learning [60.93245342663455]
Current machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models.
arXiv Detail & Related papers (2025-02-26T13:13:24Z)
Challenging reaction prediction models to generalize to novel chemistry [12.33727805025678]
We report a series of evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled datasets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published in years after those in their training set, mimicking real-world deployment.
arXiv Detail & Related papers (2025-01-11T23:49:14Z)
Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases [3.885174353072695]
Planning and conducting chemical syntheses remains a major bottleneck in the discovery of functional small molecules. Inspired by how chemists use different strategies to ideate reactions, we propose Chimera: a framework for building highly accurate reaction models.
arXiv Detail & Related papers (2024-12-06T18:55:19Z)
Learning Chemical Reaction Representation with Reactant-Product Alignment [50.28123475356234]
This paper introduces modelname, a novel chemical reaction representation learning model tailored for a variety of organic-reaction-related tasks. By integrating atomic correspondence between reactants and products, our model discerns the molecular transformations that occur during the reaction, thereby enhancing the comprehension of the reaction mechanism. We have designed an adapter structure to incorporate reaction conditions into the chemical reaction representation, allowing the model to handle diverse reaction conditions and adapt to various datasets and downstream tasks, e.g., reaction performance prediction.
arXiv Detail & Related papers (2024-11-26T17:41:44Z)
Beyond Major Product Prediction: Reproducing Reaction Mechanisms with Machine Learning Models Trained on a Large-Scale Mechanistic Dataset [10.968137261042715]
Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. We construct such a dataset by imputing intermediates between experimentally reported reactants and products using expert reaction templates and train several machine learning models on the resulting dataset of 5,184,184 elementary steps.
arXiv Detail & Related papers (2024-03-07T15:26:23Z)
Predictive Churn with the Set of Good Models [64.05949860750235]
We study the effect of conflicting predictions over the set of near-optimal machine learning models. We present theoretical results on the expected churn between models within the Rashomon set. We show how our approach can be used to better anticipate, reduce, and avoid churn in consumer-facing applications.
arXiv Detail & Related papers (2024-02-12T16:15:25Z)
Machine Learning Small Molecule Properties in Drug Discovery [44.62264781248437]
We review a wide range of properties, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) We discuss existing popular descriptors and embeddings, such as chemical fingerprints and graph-based neural networks. Finally, techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery are assessed.
arXiv Detail & Related papers (2023-08-02T22:18:41Z)
Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model [3.3031562864527664]
We propose a unified framework that addresses both the reaction representation learning and molecule generation tasks. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks.
arXiv Detail & Related papers (2023-03-13T10:06:41Z)
Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation. We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria. Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z)
MetaRF: Differentiable Random Forest for Reaction Yield Prediction with a Few Trails [58.47364143304643]
In this paper, we focus on the reaction yield prediction problem. We first put forth MetaRF, an attention-based differentiable random forest model specially designed for the few-shot yield prediction. To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method.
arXiv Detail & Related papers (2022-08-22T06:40:13Z)
Toward Development of Machine Learned Techniques for Production of Compact Kinetic Models [0.0]
Chemical kinetic models are an essential component in the development and optimisation of combustion devices. We present a novel automated compute intensification methodology to produce overly-reduced and optimised chemical kinetic models.
arXiv Detail & Related papers (2022-02-16T12:31:24Z)
Improving Molecular Representation Learning with Metric Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems. MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z)
Integrating Expert ODEs into Neural ODEs: Pharmacology and Disease Progression [71.7560927415706]
latent hybridisation model (LHM) integrates a system of expert-designed ODEs with machine-learned Neural ODEs to fully describe the dynamics of the system. We evaluate LHM on synthetic data as well as real-world intensive care data of COVID-19 patients.
arXiv Detail & Related papers (2021-06-05T11:42:45Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis [1.6449390849183363]
Retrosynthesis is a problem to infer reactant compounds to synthesize a given product compound through chemical reactions. Recent studies on retrosynthesis focus on proposing more sophisticated prediction models. The dataset to feed the models also plays an essential role in achieving the best generalizing models.
arXiv Detail & Related papers (2020-10-02T05:27:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.