Data-Error Scaling in Machine Learning on Natural Discrete Combinatorial Mutation-prone Sets: Case Studies on Peptides and Small Molecules
- URL: http://arxiv.org/abs/2405.05167v1
- Date: Wed, 8 May 2024 16:04:50 GMT
- Title: Data-Error Scaling in Machine Learning on Natural Discrete Combinatorial Mutation-prone Sets: Case Studies on Peptides and Small Molecules
- Authors: Vanni Doffini, O. Anatole von Lilienfeld, Michael A. Nash,
- Abstract summary: We investigate trends in the data-error scaling behavior of machine learning (ML) models trained on discrete spaces that are prone-to-mutation.
In contrast to typical data-error scaling, our results showed discontinuous monotonic phase transitions during learning.
We present an alternative strategy to normalize learning curves and the concept of mutant based shuffling.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate trends in the data-error scaling behavior of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computationally generated training data. Our synthetic datasets comprise i) two na\"ive functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs. In contrast to typical data-error scaling, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and the concept of mutant based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.
Related papers
- Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network [0.9736758288065405]
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences.
In this work, we introduce a novel stacked ensemble based mutagenicity prediction model.
arXiv Detail & Related papers (2024-09-03T09:14:21Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Uncertainty quantification for predictions of atomistic neural networks [0.0]
This paper explores the value of uncertainty quantification on predictions for trained neural networks (NNs) on quantum chemical reference data.
The architecture of the PhysNet NN was suitably modified and the resulting model was evaluated with different metrics to quantify calibration, quality of predictions, and whether prediction error and the predicted uncertainty can be correlated.
arXiv Detail & Related papers (2022-07-14T13:39:43Z) - Automated analysis of continuum fields from atomistic simulations using
statistical machine learning [0.0]
We develop a methodology using statistical data mining and machine learning algorithms to automate the analysis of continuum field variables in atomistic simulations.
We focus on three important field variables: total strain, elastic strain and microrotation.
The peaks in the distribution of total strain are identified with a Gaussian mixture model and methods to circumvent overfitting problems are presented.
arXiv Detail & Related papers (2022-06-16T10:05:43Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Biased Hypothesis Formation From Projection Pursuit [0.0]
The effect of bias on hypothesis formation is characterized for an automated data-driven pursuit neural network.
This intelligent exploratory process partitions a complete state space into disjoint subspaces to create working hypotheses.
arXiv Detail & Related papers (2022-01-03T22:02:26Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - SPLDExtraTrees: Robust machine learning approach for predicting kinase
inhibitor resistance [1.0674604700001966]
We propose a robust machine learning method, SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation.
The proposed method ranks training data following a specific scheme that starts with easy-to-learn samples.
Experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios.
arXiv Detail & Related papers (2021-11-15T09:07:45Z) - Equivariant vector field network for many-body system modeling [65.22203086172019]
Equivariant Vector Field Network (EVFN) is built on a novel equivariant basis and the associated scalarization and vectorization layers.
We evaluate our method on predicting trajectories of simulated Newton mechanics systems with both full and partially observed data.
arXiv Detail & Related papers (2021-10-26T14:26:25Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.