Collinear datasets augmentation using Procrustes validation sets
- URL: http://arxiv.org/abs/2312.04911v1
- Date: Fri, 8 Dec 2023 09:07:11 GMT
- Title: Collinear datasets augmentation using Procrustes validation sets
- Authors: Sergey Kucheryavskiy and Sergei Zhilin
- Abstract summary: We propose a new method for augmentation of numeric and mixed datasets.
The method generates additional data points by utilizing cross-validation resampling and latent variable modeling.
It is particularly efficient for datasets with moderate to high degrees of collinearity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new method for the augmentation of numeric and
mixed datasets. The method generates additional data points by utilizing
cross-validation resampling and latent variable modeling. It is particularly
efficient for datasets with moderate to high degrees of collinearity, as it
directly utilizes this property for generation. The method is simple, fast, and
has very few parameters, which, as shown in the paper, do not require specific
tuning. It has been tested on several real datasets; here, we report detailed
results for two cases, prediction of protein in minced meat based on near
infrared spectra (fully numeric data with high degree of collinearity) and
discrimination of patients referred for coronary angiography (mixed data, with
both numeric and categorical variables, and moderate collinearity). In both
cases, artificial neural networks were employed for developing the regression
and the discrimination models. The results show a clear improvement in the
performance of the models; thus for the prediction of meat protein, fitting the
model to the augmented data resulted in a reduction in the root mean squared
error computed for the independent test set by 1.5 to 3 times.
Related papers
- Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - Data Augmentation Scheme for Raman Spectra with Highly Correlated
Annotations [0.23090185577016453]
We exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels.
We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training.
arXiv Detail & Related papers (2024-02-01T18:46:28Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Data Augmentation for Seizure Prediction with Generative Diffusion Model [26.967247641926814]
Seizure prediction is of great importance to improve the life of patients.
The severe imbalance problem between preictal and interictal data still poses a great challenge.
Data augmentation is an intuitive way to solve this problem.
We propose a novel data augmentation method with diffusion model called DiffEEG.
arXiv Detail & Related papers (2023-06-14T05:44:53Z) - Scalable Regularised Joint Mixture Models [2.0686407686198263]
In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions.
We propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels.
The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large.
arXiv Detail & Related papers (2022-05-03T13:38:58Z) - A Variational Autoencoder for Heterogeneous Temporal and Longitudinal
Data [0.3749861135832073]
Recently proposed extensions to VAEs that can handle temporal and longitudinal data have applications in healthcare, behavioural modelling, and predictive maintenance.
We propose the heterogeneous longitudinal VAE (HL-VAE) that extends the existing temporal and longitudinal VAEs to heterogeneous data.
HL-VAE provides efficient inference for high-dimensional datasets and includes likelihood models for continuous, count, categorical, and ordinal data.
arXiv Detail & Related papers (2022-04-20T10:18:39Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Increased peak detection accuracy in over-dispersed ChIP-seq data with
supervised segmentation models [2.2559617939136505]
We show that unconstrained multiple changepoint detection model, with alternative noise assumptions and a suitable setup, reduces the over-dispersion exhibited by count data.
Results: We show that the unconstrained multiple changepoint detection model, with alternative noise assumptions and a suitable setup, reduces the over-dispersion exhibited by count data.
arXiv Detail & Related papers (2020-12-12T16:03:27Z) - An Investigation of Why Overparameterization Exacerbates Spurious
Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior.
We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.