Retrosynthesis prediction enhanced by in-silico reaction data
augmentation
- URL: http://arxiv.org/abs/2402.00086v1
- Date: Wed, 31 Jan 2024 07:40:37 GMT
- Title: Retrosynthesis prediction enhanced by in-silico reaction data
augmentation
- Authors: Xu Zhang and Yiming Mo and Wenguan Wang and Yi Yang
- Abstract summary: We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
- Score: 66.5643280109899
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in machine learning (ML) have expedited retrosynthesis
research by assisting chemists to design experiments more efficiently. However,
all ML-based methods consume substantial amounts of paired training data (i.e.,
chemical reaction: product-reactant(s) pair), which is costly to obtain.
Moreover, companies view reaction data as a valuable asset and restrict the
accessibility to researchers. These issues prevent the creation of more
powerful retrosynthesis models due to their data-driven nature. As a response,
we exploit easy-to-access unpaired data (i.e., one component of
product-reactant(s) pair) for generating in-silico paired data to facilitate
model training. Specifically, we present RetroWISE, a self-boosting framework
that employs a base model inferred from real paired data to perform in-silico
reaction generation and augmentation using unpaired data, ultimately leading to
a superior model. On three benchmark datasets, RetroWISE achieves the best
overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy
on the USPTO-50K test dataset). Moreover, it consistently improves the
prediction accuracy of rare transformations. These results show that Retro-
WISE overcomes the training bottleneck by in-silico reactions, thereby paving
the way toward more effective ML-based retrosynthesis models.
Related papers
- Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models.
We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z) - ReacLLaMA: Merging chemical and textual information in chemical
reactivity AI models [0.0]
Chemical reactivity models are developed to predict chemical reaction outcomes in the form of classification (success/failure) or regression (product yield) tasks.
The vast majority of the reported models are trained solely on chemical information such as reactants, products, reagents, and solvents.
Herein incorporation of procedural text with the aim to augment the Graphormer reactivity model and improve its accuracy is presented.
arXiv Detail & Related papers (2024-01-30T18:57:08Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - PartMix: Regularization Strategy to Learn Part Discovery for
Visible-Infrared Person Re-identification [76.40417061480564]
We present a novel data augmentation technique, dubbed PartMix, for part-based Visible-Infrared person Re-IDentification (VI-ReID) models.
We synthesize the augmented samples by mixing the part descriptors across the modalities to improve the performance of part-based VI-ReID models.
arXiv Detail & Related papers (2023-04-04T05:21:23Z) - Backdoor Attacks Against Dataset Distillation [24.39067295054253]
This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain.
We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING.
Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases.
arXiv Detail & Related papers (2023-01-03T16:58:34Z) - Multimodal Transformer-based Model for Buchwald-Hartwig and
Suzuki-Miyaura Reaction Yield Prediction [0.0]
The model consists of a pre-trained bidirectional transformer-based encoder (BERT) and a multi-layer perceptron (MLP) with a regression head to predict the yield.
We tested the model's performance on out-of-sample dataset splits of Buchwald-Hartwig and achieved comparable results with the state-of-the-art.
arXiv Detail & Related papers (2022-04-27T07:28:27Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z) - Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis [1.6449390849183363]
Retrosynthesis is a problem to infer reactant compounds to synthesize a given product compound through chemical reactions.
Recent studies on retrosynthesis focus on proposing more sophisticated prediction models.
The dataset to feed the models also plays an essential role in achieving the best generalizing models.
arXiv Detail & Related papers (2020-10-02T05:27:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.