SpanSeq: Similarity-based sequence data splitting method for improved
development and assessment of deep learning projects
- URL: http://arxiv.org/abs/2402.14482v2
- Date: Tue, 5 Mar 2024 12:02:46 GMT
- Title: SpanSeq: Similarity-based sequence data splitting method for improved
development and assessment of deep learning projects
- Authors: Alfred Ferrer Florensa, Jose Juan Almagro Armenteros, Henrik Nielsen,
Frank M{\o}ller Aarestrup, Philip Thomas Lanken Conradsen Clausen
- Abstract summary: We present SpanSeq, a database partition method for machine learning that can scale to most biological sequences.
We also explore the effect of not restraining similarity between sets by reproducing the development of the state-of-the-art model DeepLoc.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of deep learning models in computational biology has increased
massively in recent years, and is expected to do so further with the current
advances in fields like Natural Language Processing. These models, although
able to draw complex relations between input and target, are also largely
inclined to learn noisy deviations from the pool of data used during their
development. In order to assess their performance on unseen data (their
capacity to generalize), it is common to randomly split the available data in
development (train/validation) and test sets. This procedure, although
standard, has lately been shown to produce dubious assessments of
generalization due to the existing similarity between samples in the databases
used. In this work, we present SpanSeq, a database partition method for machine
learning that can scale to most biological sequences (genes, proteins and
genomes) in order to avoid data leakage between sets. We also explore the
effect of not restraining similarity between sets by reproducing the
development of the state-of-the-art model DeepLoc, not only confirming the
consequences of randomly splitting databases on the model assessment, but
expanding those repercussions to the model development. SpanSeq is available
for downloading and installing at
https://github.com/genomicepidemiology/SpanSeq.
Related papers
- Mambular: A Sequential Model for Tabular Deep Learning [0.7184556517162347]
This paper investigates the use of autoregressive state-space models for tabular data.
We compare their performance against established benchmark models.
Our findings indicate that interpreting features as a sequence and processing them can lead to significant performance improvement.
arXiv Detail & Related papers (2024-08-12T16:57:57Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Unlearning Spurious Correlations in Chest X-ray Classification [4.039245878626345]
We train a deep learning model using a Covid-19 chest X-ray dataset.
We show how this dataset can lead to spurious correlations due to unintended confounding regions.
XBL is a deep learning approach that goes beyond interpretability by utilizing model explanations to interactively unlearn spurious correlations.
arXiv Detail & Related papers (2023-08-02T12:59:10Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values [2.9707233220536313]
Federated learning makes it possible to train a machine learning model on decentralized data.
We propose a novel method called VertiBayes to train Bayesian networks on vertically partitioned data.
We experimentally show our approach produces models comparable to those learnt using traditional algorithms.
arXiv Detail & Related papers (2022-10-31T11:13:35Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Time-Varying Propensity Score to Bridge the Gap between the Past and Present [104.46387765330142]
We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data.
We demonstrate different ways of implementing it and evaluate it on a variety of problems.
arXiv Detail & Related papers (2022-10-04T07:21:49Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Supervised Learning and Model Analysis with Compositional Data [4.082799056366927]
KernelBiome is a kernel-based non-parametric regression and classification framework for compositional data.
We demonstrate on par or improved performance compared with state-of-the-art machine learning methods.
arXiv Detail & Related papers (2022-05-15T12:33:43Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Harmonization with Flow-based Causal Inference [12.739380441313022]
This paper presents a normalizing-flow-based method to perform counterfactual inference upon a structural causal model (SCM) to harmonize medical data.
We evaluate on multiple, large, real-world medical datasets to observe that this method leads to better cross-domain generalization compared to state-of-the-art algorithms.
arXiv Detail & Related papers (2021-06-12T19:57:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.