Unassisted Noise Reduction of Chemical Reaction Data Sets
- URL: http://arxiv.org/abs/2102.01399v1
- Date: Tue, 2 Feb 2021 09:34:34 GMT
- Title: Unassisted Noise Reduction of Chemical Reaction Data Sets
- Authors: Alessandra Toniato, Philippe Schwaller, Antonio Cardinale, Joppe
Geluykens and Teodoro Laino
- Abstract summary: We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
- Score: 59.127921057012564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing deep learning models applied to reaction prediction in organic
chemistry can reach high levels of accuracy (> 90% for Natural Language
Processing-based ones). With no chemical knowledge embedded than the
information learnt from reaction data, the quality of the data sets plays a
crucial role in the performance of the prediction models. While human curation
is prohibitively expensive, the need for unaided approaches to remove
chemically incorrect entries from existing data sets is essential to improve
artificial intelligence models' performance in synthetic chemistry tasks. Here
we propose a machine learning-based, unassisted approach to remove chemically
wrong entries from chemical reaction collections. We applied this method to the
collection of chemical reactions Pistachio and to an open data set, both
extracted from USPTO (United States Patent Office) patents. Our results show an
improved prediction quality for models trained on the cleaned and balanced data
sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13
percentage points and the value of the cumulative Jensen Shannon divergence
decreases by 30% compared to its original record. The coverage remains high
with 97%, and the value of the class-diversity is not affected by the cleaning.
The proposed strategy is the first unassisted rule-free technique to address
automatic noise reduction in chemical data sets.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.