Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression
- URL: http://arxiv.org/abs/2508.13829v1
- Date: Tue, 19 Aug 2025 13:40:04 GMT
- Title: Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression
- Authors: Samuel Stocksieker, Denys pommeret, Arthur Charpentier,
- Abstract summary: Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms.<n>We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions.<n>To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space.
- Score: 1.2289361708127877
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework, which is a critical problem. We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions. However, VAEs can be inefficient with imbalanced data like other standard approaches. To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space. We evaluate the efficiency of this method through numerical comparisons with competitors on benchmark datasets for IR.
Related papers
- Generalization is not a universal guarantee: Estimating similarity to training data with an ensemble out-of-distribution metric [0.09363323206192666]
Failure of machine learning models to generalize to new data is a core problem limiting the reliability of AI systems.<n>We propose a standardized approach for assessing data similarity by constructing a supervised autoencoder for generalizability estimation (SAGE)<n>We show that out-of-the-box model performance increases after SAGE score filtering, even when applied to data from the model's own training and test datasets.
arXiv Detail & Related papers (2025-02-22T19:21:50Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Error Distribution Smoothing:Advancing Low-Dimensional Imbalanced Regression [2.435853975142516]
In real-world regression tasks, datasets frequently exhibit imbalanced distributions, characterized by a scarcity of data in high-complexity regions and an abundance in low-complexity areas.<n>We introduce a novel concept of Imbalanced Regression, which takes into account both the complexity of the problem and the density of data points, extending beyond traditional definitions that focus only on data density.<n>We propose Error Distribution Smoothing (EDS) as a solution to tackle imbalanced regression, effectively selecting a representative subset from the dataset to reduce redundancy while maintaining balance and representativeness.
arXiv Detail & Related papers (2025-02-04T12:40:07Z) - Data Augmentation with Variational Autoencoder for Imbalanced Dataset [1.2289361708127877]
Learning from an imbalanced distribution presents a major challenge in predictive modeling.<n>We develop a novel approach for generating data, combining VAE with a smoothed bootstrap, specifically designed to address the challenges of IR.
arXiv Detail & Related papers (2024-12-09T22:59:03Z) - Adaptive debiased SGD in high-dimensional GLMs with streaming data [4.704144189806667]
This paper introduces a novel approach to online inference in high-dimensional generalized linear models.<n>Our method operates in a single-pass mode, making it different from existing methods that require full dataset access or large-dimensional summary statistics storage.<n>The core of our methodological innovation lies in an adaptive descent algorithm tailored for dynamic objective functions, coupled with a novel online debiasing procedure.
arXiv Detail & Related papers (2024-05-28T15:36:48Z) - Vanishing Variance Problem in Fully Decentralized Neural-Network Systems [0.8212195887472242]
Federated learning and gossip learning are emerging methodologies designed to mitigate data privacy concerns.
Our research introduces a variance-corrected model averaging algorithm.
Our simulation results demonstrate that our approach enables gossip learning to achieve convergence efficiency comparable to that of federated learning.
arXiv Detail & Related papers (2024-04-06T12:49:20Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Generalized Oversampling for Learning from Imbalanced datasets and
Associated Theory [0.0]
In supervised learning, it is quite frequent to be confronted with real imbalanced datasets.
We propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates.
We evaluate the performance of the GOLIATH algorithm in imbalanced regression situations.
arXiv Detail & Related papers (2023-08-05T23:08:08Z) - Non-Invasive Fairness in Learning through the Lens of Data Drift [88.37640805363317]
We show how to improve the fairness of Machine Learning models without altering the data or the learning algorithm.
We use a simple but key insight: the divergence of trends between different populations, and, consecutively, between a learned model and minority populations, is analogous to data drift.
We explore two strategies (model-splitting and reweighing) to resolve this drift, aiming to improve the overall conformance of models to the underlying data.
arXiv Detail & Related papers (2023-03-30T17:30:42Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Compound Batch Normalization for Long-tailed Image Classification [77.42829178064807]
We propose a compound batch normalization method based on a Gaussian mixture.
It can model the feature space more comprehensively and reduce the dominance of head classes.
The proposed method outperforms existing methods on long-tailed image classification.
arXiv Detail & Related papers (2022-12-02T07:31:39Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.