Regression Augmentation With Data-Driven Segmentation
- URL: http://arxiv.org/abs/2508.01455v1
- Date: Sat, 02 Aug 2025 18:12:11 GMT
- Title: Regression Augmentation With Data-Driven Segmentation
- Authors: Shayan Alahyari, Shiva Mehdipour Ghobadlou, Mike Domaratzki,
- Abstract summary: Imbalanced regression arises when the target distribution is skewed, causing models to focus on dense regions and struggle with underrepresented (minority) samples.<n>We propose a fully data-driven GAN-based augmentation framework that uses Mahalanobis-Gaussian Mixture Modeling (GMM) to automatically identify minority samples.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imbalanced regression arises when the target distribution is skewed, causing models to focus on dense regions and struggle with underrepresented (minority) samples. Despite its relevance across many applications, few methods have been designed specifically for this challenge. Existing approaches often rely on fixed, ad hoc thresholds to label samples as rare or common, overlooking the continuous complexity of the joint feature-target space and fail to represent the true underlying rare regions. To address these limitations, we propose a fully data-driven GAN-based augmentation framework that uses Mahalanobis-Gaussian Mixture Modeling (GMM) to automatically identify minority samples and employs deterministic nearest-neighbour matching to enrich sparse regions. Rather than preset thresholds, our method lets the data determine which observations are truly rare. Evaluation on 32 benchmark imbalanced regression datasets demonstrates that our approach consistently outperforms state-of-the-art data augmentation methods.
Related papers
- Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection [53.137651284042434]
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples limits the effectiveness of existing methods.<n>We propose Generate grained Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework.<n>GAA generates realistic, diverse, and semantically aligned anomalies using only a small number of samples.
arXiv Detail & Related papers (2025-07-13T12:56:59Z) - CART-based Synthetic Tabular Data Generation for Imbalanced Regression [1.342834401139078]
We propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression.<n>The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space.<n>Our experimental study focuses on the prediction of extreme target values across benchmark datasets.
arXiv Detail & Related papers (2025-06-03T12:42:20Z) - Local distribution-based adaptive oversampling for imbalanced regression [0.0]
Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions.<n>We propose LDAO (Local Distribution-based Adaptive Oversampling), a novel data-level approach that avoids categorizing individual samples as rare or frequent.<n> LDAO achieves a balanced representation across the entire target range while preserving the inherent statistical structure within each local distribution.
arXiv Detail & Related papers (2025-04-19T14:36:41Z) - Unsupervised Domain Adaptation for Action Recognition via Self-Ensembling and Conditional Embedding Alignment [2.06242362470764]
We propose a novel joint optimization architecture comprised of three functions: consistency regularizer, temporal ensemble and conditional distribution alignment.
$mu$DAR results in a range of $approx$ 4-12% average macro-F1 score improvement over six state-of-the-art UDA methods in four benchmark wHAR datasets.
arXiv Detail & Related papers (2024-10-23T00:59:27Z) - Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers [49.97755400231656]
We present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers.<n>We show that score mismatches result in an distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions.<n>This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise.
arXiv Detail & Related papers (2024-10-17T16:42:12Z) - Aggregation Weighting of Federated Learning via Generalization Bound
Estimation [65.8630966842025]
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions.
We replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model.
arXiv Detail & Related papers (2023-11-10T08:50:28Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - The Decaying Missing-at-Random Framework: Model Doubly Robust Causal Inference with Partially Labeled Data [8.916614661563893]
We introduce a missing-at-random (decaying MAR) framework and associated approaches for doubly robust causal inference.<n>This simultaneously addresses selection bias in the labeling mechanism and the extreme imbalance between labeled and unlabeled groups.<n>To ensure robust causal conclusions, we propose a bias-reduced SS estimator for the average treatment effect.
arXiv Detail & Related papers (2023-05-22T07:37:12Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - Semi-supervised Long-tailed Recognition using Alternate Sampling [95.93760490301395]
Main challenges in long-tailed recognition come from the imbalanced data distribution and sample scarcity in its tail classes.
We propose a new recognition setting, namely semi-supervised long-tailed recognition.
We demonstrate significant accuracy improvements over other competitive methods on two datasets.
arXiv Detail & Related papers (2021-05-01T00:43:38Z) - Model-based clustering of partial records [11.193504036335503]
We develop clustering methodology through a model-based approach using the marginal density for the observed values.
We compare our algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set.
Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation.
arXiv Detail & Related papers (2021-03-30T13:30:59Z) - Targeted stochastic gradient Markov chain Monte Carlo for hidden Markov models with rare latent states [48.705095800341944]
Markov chain Monte Carlo (MCMC) algorithms for hidden Markov models often rely on the forward-backward sampler.
This makes them computationally slow as the length of the time series increases, motivating the development of sub-sampling-based approaches.
We propose a targeted sub-sampling approach that over-samples observations corresponding to rare latent states when calculating the gradient of parameters associated with them.
arXiv Detail & Related papers (2018-10-31T17:44:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.