Chemical Property Prediction Under Experimental Biases
- URL: http://arxiv.org/abs/2009.08687v3
- Date: Thu, 9 Dec 2021 16:12:39 GMT
- Title: Chemical Property Prediction Under Experimental Biases
- Authors: Yang Liu and Hisashi Kashima
- Abstract summary: This study focuses on mitigating bias in the experimental datasets.
We adopted two techniques from causal inference combined with graph neural networks that can represent molecular structures.
The experimental results in four possible bias scenarios indicated that the inverse propensity scoring-based method and the counter-factual regression-based method made solid improvements.
- Score: 26.407895054724452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting the chemical properties of compounds is crucial in discovering
novel materials and drugs with specific desired characteristics. Recent
significant advances in machine learning technologies have enabled automatic
predictive modeling from past experimental data reported in the literature.
However, these datasets are often biased because of various reasons, such as
experimental plans and publication decisions, and the prediction models trained
using such biased datasets often suffer from over-fitting to the biased
distributions and perform poorly on subsequent uses. Hence, this study focused
on mitigating bias in the experimental datasets. We adopted two techniques from
causal inference combined with graph neural networks that can represent
molecular structures. The experimental results in four possible bias scenarios
indicated that the inverse propensity scoring-based method and the
counter-factual regression-based method made solid improvements.
Related papers
- Balancing Molecular Information and Empirical Data in the Prediction of Physico-Chemical Properties [8.649679686652648]
We propose a general method for combining molecular descriptors with representation learning.
The proposed hybrid model exploits chemical structure information using graph neural networks.
It automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions.
arXiv Detail & Related papers (2024-06-12T10:51:00Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation [53.27596811146316]
Diffusion models operate over a sequence of timesteps instead of instantaneous input-output relationships in previous contexts.
We present Diffusion-TracIn that incorporates this temporal dynamics and observe that samples' loss gradient norms are highly dependent on timestep.
We introduce Diffusion-ReTrac as a re-normalized adaptation that enables the retrieval of training samples more targeted to the test sample of interest.
arXiv Detail & Related papers (2024-01-17T07:58:18Z) - Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets.
We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias.
DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - MetaRF: Differentiable Random Forest for Reaction Yield Prediction with
a Few Trails [58.47364143304643]
In this paper, we focus on the reaction yield prediction problem.
We first put forth MetaRF, an attention-based differentiable random forest model specially designed for the few-shot yield prediction.
To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method.
arXiv Detail & Related papers (2022-08-22T06:40:13Z) - Statistical quantification of confounding bias in predictive modelling [0.0]
I propose the partial and full confounder tests, which probe the null hypotheses of unconfounded and fully confounded models.
The tests provide a strict control for Type I errors and high statistical power, even for non-normally and non-linearly dependent predictions.
arXiv Detail & Related papers (2021-11-01T10:35:24Z) - Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction
Prediction and Synthesis Design [0.8594140167290099]
We identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction.
First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner.
Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only.
arXiv Detail & Related papers (2021-05-06T13:11:56Z) - Stochastic Threshold Model Trees: A Tree-Based Ensemble Method for
Dealing with Extrapolation [0.0]
In the development of new materials, it is desirable to search for compounds with unprecedented physical properties.
We propose development Threshold Model Trees (STMT), which reflects the trend of the data, while maintaining the accuracy of conventional methods.
In the case of the real data, although there is no significant overall improvement in accuracy, there is one compound for which the prediction accuracy is notably improved.
arXiv Detail & Related papers (2020-09-19T05:48:01Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z) - Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of
Flaws and Benefits when Applying Over-sampling [13.463035357173045]
We focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets.
We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified.
arXiv Detail & Related papers (2020-01-15T12:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.