Related papers: Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design

Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design

URL: http://arxiv.org/abs/2105.02637v1
Date: Thu, 6 May 2021 13:11:56 GMT
Title: Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design
Authors: Ryan-Rhys Griffiths, Philippe Schwaller, Alpha A. Lee
Abstract summary: We identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction. First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner. Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only.
Score: 0.8594140167290099
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Datasets in the Natural Sciences are often curated with the goal of aiding scientific understanding and hence may not always be in a form that facilitates the application of machine learning. In this paper, we identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction. First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner. Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only. Lastly, we discuss the problem of reagent prediction, in addition to reactant prediction, in order to solve the full synthesis design problem, highlighting the mismatch between what machine learning solves and what a lab chemist would need. Our critiques are also relevant to the burgeoning field of using machine learning to accelerate progress in experimental Natural Sciences, where datasets are often split in a biased way, are highly noisy, and contextual variables that are not evident from the data strongly influence the outcome of experiments.

Related papers

The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning [4.864188241160383]
We introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking.<n>While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions.<n>We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications.
arXiv Detail & Related papers (2025-06-09T10:34:14Z)
Chemical knowledge-informed framework for privacy-aware retrosynthesis learning [60.93245342663455]
Current machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models.
arXiv Detail & Related papers (2025-02-26T13:13:24Z)
Learning Chemical Reaction Representation with Reactant-Product Alignment [50.28123475356234]
This paper introduces modelname, a novel chemical reaction representation learning model tailored for a variety of organic-reaction-related tasks. By integrating atomic correspondence between reactants and products, our model discerns the molecular transformations that occur during the reaction, thereby enhancing the comprehension of the reaction mechanism. We have designed an adapter structure to incorporate reaction conditions into the chemical reaction representation, allowing the model to handle diverse reaction conditions and adapt to various datasets and downstream tasks, e.g., reaction performance prediction.
arXiv Detail & Related papers (2024-11-26T17:41:44Z)
SynCoTrain: A Dual Classifier PU-learning Framework for Synthesizability Prediction [0.0]
We present SynCoTrain, a semi-supervised machine learning model designed to predict the synthesizability of materials. Our approach uses Positive and Unlabeled (PU) Learning to address the absence of explicit negative data. The model demonstrates robust performance, achieving high recall on internal and leave-out test sets.
arXiv Detail & Related papers (2024-11-18T19:53:19Z)
log-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling [6.310759215182946]
log-RRIM is an innovative graph transformer-based framework designed for predicting chemical reaction yields. Our approach implements a unique local-to-global reaction representation learning strategy. Its advanced modeling of reactant-reagent interactions and sensitivity to small molecular fragments make it a valuable tool for reaction planning and optimization in chemical synthesis.
arXiv Detail & Related papers (2024-10-20T18:35:56Z)
Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations. We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z)
Retrosynthesis prediction enhanced by in-silico reaction data augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z)
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [27.850325653751078]
structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. We propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions.
arXiv Detail & Related papers (2023-07-04T02:52:30Z)
ChemVise: Maximizing Out-of-Distribution Chemical Detection with the Novel Application of Zero-Shot Learning [60.02503434201552]
This research proposes learning approximations of complex exposures from training sets of simple ones. We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes.
arXiv Detail & Related papers (2023-02-09T20:19:57Z)
Rxn Hypergraph: a Hypergraph Attention Model for Chemical Reaction Representation [70.97737157902947]
There is currently no universal and widely adopted method for robustly representing chemical reactions. Here we exploit graph-based representations of molecular structures to develop and test a hypergraph attention neural network approach. We evaluate this hypergraph representation in three experiments using three independent data sets of chemical reactions.
arXiv Detail & Related papers (2022-01-02T12:33:10Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis [1.6449390849183363]
Retrosynthesis is a problem to infer reactant compounds to synthesize a given product compound through chemical reactions. Recent studies on retrosynthesis focus on proposing more sophisticated prediction models. The dataset to feed the models also plays an essential role in achieving the best generalizing models.
arXiv Detail & Related papers (2020-10-02T05:27:51Z)
Chemical Property Prediction Under Experimental Biases [26.407895054724452]
This study focuses on mitigating bias in the experimental datasets. We adopted two techniques from causal inference combined with graph neural networks that can represent molecular structures. The experimental results in four possible bias scenarios indicated that the inverse propensity scoring-based method and the counter-factual regression-based method made solid improvements.
arXiv Detail & Related papers (2020-09-18T08:40:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.