SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression
- URL: http://arxiv.org/abs/2504.21152v1
- Date: Tue, 29 Apr 2025 20:15:25 GMT
- Title: SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression
- Authors: Shayan Alahyari, Mike Domaratzki,
- Abstract summary: Imbalanced regression refers to prediction tasks where the target variable is skewed.<n>This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions.<n>We propose SMOGAN, a two-step oversampling framework for imbalanced regression.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imbalanced regression refers to prediction tasks where the target variable is skewed. This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions and therefore perform poorly on underrepresented (minority) samples. Despite the importance of this problem, only a few methods have been proposed for imbalanced regression. Many of the available solutions for imbalanced regression adapt techniques from the class imbalance domain, such as linear interpolation and the addition of Gaussian noise, to create synthetic data in sparse regions. However, in many cases, the underlying distribution of the data is complex and non-linear. Consequently, these approaches generate synthetic samples that do not accurately represent the true feature-target relationship. To overcome these limitations, we propose SMOGAN, a two-step oversampling framework for imbalanced regression. In Stage 1, an existing oversampler generates initial synthetic samples in sparse target regions. In Stage 2, we introduce DistGAN, a distribution-aware GAN that serves as SMOGAN's filtering layer and refines these samples via adversarial loss augmented with a Maximum Mean Discrepancy objective, aligning them with the true joint feature-target distribution. Extensive experiments on 23 imbalanced datasets show that SMOGAN consistently outperforms the default oversampling method without the DistGAN filtering layer.
Related papers
- Local distribution-based adaptive oversampling for imbalanced regression [0.0]
Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions.<n>We propose LDAO (Local Distribution-based Adaptive Oversampling), a novel data-level approach that avoids categorizing individual samples as rare or frequent.<n> LDAO achieves a balanced representation across the entire target range while preserving the inherent statistical structure within each local distribution.
arXiv Detail & Related papers (2025-04-19T14:36:41Z) - Histogram Approaches for Imbalanced Data Streams Regression [1.8385275253826225]
Imbalanced domains pose a significant challenge in real-world predictive analytics, particularly in the context of regression.<n>This study introduces histogram-based sampling strategies to overcome this constraint.<n> Comprehensive experiments on synthetic and real-world benchmarks demonstrate that HistUS and HistOS substantially improve rare-case prediction accuracy.
arXiv Detail & Related papers (2025-01-29T11:03:02Z) - On the Wasserstein Convergence and Straightness of Rectified Flow [54.580605276017096]
Rectified Flow (RF) is a generative model that aims to learn straight flow trajectories from noise to data.
We provide a theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution.
We present general conditions guaranteeing uniqueness and straightness of 1-RF, which is in line with previous empirical findings.
arXiv Detail & Related papers (2024-10-19T02:36:11Z) - Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers [49.97755400231656]
We present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers.<n>We show that score mismatches result in an distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions.<n>This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise.
arXiv Detail & Related papers (2024-10-17T16:42:12Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - INGB: Informed Nonlinear Granular Ball Oversampling Framework for Noisy
Imbalanced Classification [23.9207014576848]
In classification problems, the datasets are usually imbalanced, noisy or complex.
An informed nonlinear oversampling framework with the granular ball (INGB) as a new direction of oversampling is proposed in this paper.
arXiv Detail & Related papers (2023-07-03T01:55:20Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Imbalanced Class Data Performance Evaluation and Improvement using Novel
Generative Adversarial Network-based Approach: SSG and GBO [0.0]
This study proposes two novel techniques: GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG)
The preliminary computational result shows that SSG and GBO performed better on the expanded imbalanced eight benchmark datasets than the original SMOTE.
arXiv Detail & Related papers (2022-10-23T22:17:54Z) - Instability and Local Minima in GAN Training with Kernel Discriminators [20.362912591032636]
Generative Adversarial Networks (GANs) are a widely-used tool for generative modeling of complex data.
Despite their empirical success, the training of GANs is not fully understood due to the min-max optimization of the generator and discriminator.
This paper analyzes these joint dynamics when the true samples, as well as the generated samples, are discrete, finite sets, and the discriminator is kernel-based.
arXiv Detail & Related papers (2022-08-21T18:03:06Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.