Unassisted Noise Reduction of Chemical Reaction Data Sets
- URL: http://arxiv.org/abs/2102.01399v1
- Date: Tue, 2 Feb 2021 09:34:34 GMT
- Title: Unassisted Noise Reduction of Chemical Reaction Data Sets
- Authors: Alessandra Toniato, Philippe Schwaller, Antonio Cardinale, Joppe
Geluykens and Teodoro Laino
- Abstract summary: We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
- Score: 59.127921057012564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing deep learning models applied to reaction prediction in organic
chemistry can reach high levels of accuracy (> 90% for Natural Language
Processing-based ones). With no chemical knowledge embedded than the
information learnt from reaction data, the quality of the data sets plays a
crucial role in the performance of the prediction models. While human curation
is prohibitively expensive, the need for unaided approaches to remove
chemically incorrect entries from existing data sets is essential to improve
artificial intelligence models' performance in synthetic chemistry tasks. Here
we propose a machine learning-based, unassisted approach to remove chemically
wrong entries from chemical reaction collections. We applied this method to the
collection of chemical reactions Pistachio and to an open data set, both
extracted from USPTO (United States Patent Office) patents. Our results show an
improved prediction quality for models trained on the cleaned and balanced data
sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13
percentage points and the value of the cumulative Jensen Shannon divergence
decreases by 30% compared to its original record. The coverage remains high
with 97%, and the value of the class-diversity is not affected by the cleaning.
The proposed strategy is the first unassisted rule-free technique to address
automatic noise reduction in chemical data sets.
Related papers
- ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - An Autonomous Large Language Model Agent for Chemical Literature Data
Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature.
Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - ReacLLaMA: Merging chemical and textual information in chemical
reactivity AI models [0.0]
Chemical reactivity models are developed to predict chemical reaction outcomes in the form of classification (success/failure) or regression (product yield) tasks.
The vast majority of the reported models are trained solely on chemical information such as reactants, products, reagents, and solvents.
Herein incorporation of procedural text with the aim to augment the Graphormer reactivity model and improve its accuracy is presented.
arXiv Detail & Related papers (2024-01-30T18:57:08Z) - MetaRF: Differentiable Random Forest for Reaction Yield Prediction with
a Few Trails [58.47364143304643]
In this paper, we focus on the reaction yield prediction problem.
We first put forth MetaRF, an attention-based differentiable random forest model specially designed for the few-shot yield prediction.
To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method.
arXiv Detail & Related papers (2022-08-22T06:40:13Z) - Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction
Prediction and Synthesis Design [0.8594140167290099]
We identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction.
First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner.
Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only.
arXiv Detail & Related papers (2021-05-06T13:11:56Z) - Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis [1.6449390849183363]
Retrosynthesis is a problem to infer reactant compounds to synthesize a given product compound through chemical reactions.
Recent studies on retrosynthesis focus on proposing more sophisticated prediction models.
The dataset to feed the models also plays an essential role in achieving the best generalizing models.
arXiv Detail & Related papers (2020-10-02T05:27:51Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.