Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis
- URL: http://arxiv.org/abs/2010.00792v1
- Date: Fri, 2 Oct 2020 05:27:51 GMT
- Title: Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis
- Authors: Katsuhiko Ishiguro, Kazuya Ujihara, Ryohto Sawada, Hirotaka Akita,
Masaaki Kotera
- Abstract summary: Retrosynthesis is a problem to infer reactant compounds to synthesize a given product compound through chemical reactions.
Recent studies on retrosynthesis focus on proposing more sophisticated prediction models.
The dataset to feed the models also plays an essential role in achieving the best generalizing models.
- Score: 1.6449390849183363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrosynthesis is a problem to infer reactant compounds to synthesize a given
product compound through chemical reactions. Recent studies on retrosynthesis
focus on proposing more sophisticated prediction models, but the dataset to
feed the models also plays an essential role in achieving the best generalizing
models. Generally, a dataset that is best suited for a specific task tends to
be small. In such a case, it is the standard solution to transfer knowledge
from a large or clean dataset in the same domain. In this paper, we conduct a
systematic and intensive examination of data transfer approaches on end-to-end
generative models, in application to retrosynthesis. Experimental results show
that typical data transfer methods can improve test prediction scores of an
off-the-shelf Transformer baseline model. Especially, the pre-training plus
fine-tuning approach boosts the accuracy scores of the baseline, achieving the
new state-of-the-art. In addition, we conduct a manual inspection for the
erroneous prediction results. The inspection shows that the pre-training plus
fine-tuning models can generate chemically appropriate or sensible proposals in
almost all cases.
Related papers
- Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models.
We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z) - Transfer Learning for Molecular Property Predictions from Small Data Sets [0.0]
We benchmark common machine learning models for the prediction of molecular properties on two small data sets.
We present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets.
arXiv Detail & Related papers (2024-04-20T14:25:34Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - MRCLens: an MRC Dataset Bias Detection Toolkit [82.44296974850639]
We introduce MRCLens, a toolkit that detects whether biases exist before users train the full model.
For the convenience of introducing the toolkit, we also provide a categorization of common biases in MRC.
arXiv Detail & Related papers (2022-07-18T21:05:39Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z) - Improving Molecular Design by Stochastic Iterative Target Augmentation [38.44457632751997]
Generative models in molecular design tend to be richly parameterized, data-hungry neural models.
We propose a surprisingly effective self-training approach for iteratively creating additional molecular targets.
Our approach outperforms the previous state-of-the-art in conditional molecular design by over 10% in absolute gain.
arXiv Detail & Related papers (2020-02-11T22:40:04Z) - Forecasting Industrial Aging Processes with Machine Learning Methods [0.0]
We evaluate a wider range of data-driven models, comparing some traditional stateless models to more complex recurrent neural networks.
Our results show that recurrent models produce near perfect predictions when trained on larger datasets.
arXiv Detail & Related papers (2020-02-05T13:06:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.