GK-SMOTE: A Hyperparameter-free Noise-Resilient Gaussian KDE-Based Oversampling Approach
- URL: http://arxiv.org/abs/2509.11163v1
- Date: Sun, 14 Sep 2025 08:50:30 GMT
- Title: GK-SMOTE: A Hyperparameter-free Noise-Resilient Gaussian KDE-Based Oversampling Approach
- Authors: Mahabubur Rahman Miraj, Hongyu Huang, Ting Yang, Jinxue Zhao, Nankun Mu, Xinyu Lei,
- Abstract summary: Imbalanced classification is a significant challenge in machine learning, especially in critical applications like medical diagnosis, fraud detection, and cybersecurity.<n>Traditional oversampling techniques, such as SMOTE, often fail to handle label noise and complex data distributions, leading to reduced classification accuracy.<n>We propose GK-SMOTE, a noise-resilient extension of SMOTE, built on Gaussian Kernel Density Estimation (KDE)<n>GK-SMOTE enhances class separability by generating synthetic samples in high-density minority regions, while effectively avoiding noisy or ambiguous areas.
- Score: 5.681470105992214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imbalanced classification is a significant challenge in machine learning, especially in critical applications like medical diagnosis, fraud detection, and cybersecurity. Traditional oversampling techniques, such as SMOTE, often fail to handle label noise and complex data distributions, leading to reduced classification accuracy. In this paper, we propose GK-SMOTE, a hyperparameter-free, noise-resilient extension of SMOTE, built on Gaussian Kernel Density Estimation (KDE). GK-SMOTE enhances class separability by generating synthetic samples in high-density minority regions, while effectively avoiding noisy or ambiguous areas. This self-adaptive approach uses Gaussian KDE to differentiate between safe and noisy regions, ensuring more accurate sample generation without requiring extensive parameter tuning. Our extensive experiments on diverse binary classification datasets demonstrate that GK-SMOTE outperforms existing state-of-the-art oversampling techniques across key evaluation metrics, including MCC, Balanced Accuracy, and AUPRC. The proposed method offers a robust, efficient solution for imbalanced classification tasks, especially in noisy data environments, making it an attractive choice for real-world applications.
Related papers
- Sharpness-aware Dynamic Anchor Selection for Generalized Category Discovery [61.694524826522205]
Given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes.<n>Large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data.<n>We propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS)
arXiv Detail & Related papers (2025-12-15T02:24:06Z) - Adaptive kernel-density approach for imbalanced binary classification [2.3126620160776845]
Class imbalance is a common challenge in real-world binary classification tasks.<n>We propose a novel approach called Kernel-density-Oriented Threshold Adjustment with Regional Optimization (KOTARO)<n>In KOTARO, the bandwidth of Gaussian basis functions is dynamically tuned based on the estimated density around each sample.
arXiv Detail & Related papers (2025-10-05T05:45:00Z) - Optimal Hyperspectral Undersampling Strategy for Satellite Imaging [0.0]
We propose a novel Iterative Wavelet-based Gradient Sampling (IWGS) method for hyperspectral image classification.<n>IWGS incrementally selects the most informative spectral bands by analyzing gradients within the wavelet-transformed domain.<n>We conduct comprehensive experiments on two widely-used benchmark HSI datasets: Houston 2013 and Indian Pines.<n>IWGS consistently outperforms state-of-the-art band selection and classification techniques in terms of both accuracy and computational efficiency.
arXiv Detail & Related papers (2025-04-27T15:33:33Z) - GHOST: Gaussian Hypothesis Open-Set Technique [10.426399605773083]
Evaluations of large-scale recognition methods typically focus on overall performance.<n> addressing fairness in Open-Set Recognition (OSR), we demonstrate that per-class performance can vary dramatically.<n>We apply Z-score normalization to logits to mitigate the impact of feature magnitudes that deviate from the model's expectations.
arXiv Detail & Related papers (2025-02-05T16:56:14Z) - Noise-Adaptive Conformal Classification with Marginal Coverage [53.74125453366155]
We introduce an adaptive conformal inference method capable of efficiently handling deviations from exchangeability caused by random label noise.<n>We validate our method through extensive numerical experiments demonstrating its effectiveness on synthetic and real data sets.
arXiv Detail & Related papers (2025-01-29T23:55:23Z) - A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy.<n>We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods.<n>By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z) - Robust Gaussian Processes via Relevance Pursuit [17.39376866275623]
We propose and study a GP model that achieves robustness against sparse outliers by inferring data-point-specific noise levels.<n>We show, surprisingly, that the model can be parameterized such that the associated log marginal likelihood is strongly concave in the data-point-specific noise variances.
arXiv Detail & Related papers (2024-10-31T17:59:56Z) - On the Privacy of Selection Mechanisms with Gaussian Noise [44.577599546904736]
We revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise.
We find that it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold.
arXiv Detail & Related papers (2024-02-09T02:11:25Z) - Risk-Sensitive Diffusion: Robustly Optimizing Diffusion Models with Noisy Samples [58.68233326265417]
Non-image data are prevalent in real applications and tend to be noisy.
Risk-sensitive SDE is a type of differential equation (SDE) parameterized by the risk vector.
We conduct systematic studies for both Gaussian and non-Gaussian noise distributions.
arXiv Detail & Related papers (2024-02-03T08:41:51Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - Hybrid Random Features [60.116392415715275]
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs)
HRFs automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest.
arXiv Detail & Related papers (2021-10-08T20:22:59Z) - SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier
Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples.
We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.