Related papers: An empirical evaluation of imbalanced data strategies from a practitioner's point of view

An empirical evaluation of imbalanced data strategies from a practitioner's point of view

URL: http://arxiv.org/abs/1810.07168v2
Date: Fri, 10 Nov 2023 15:54:40 GMT
Title: An empirical evaluation of imbalanced data strategies from a practitioner's point of view
Authors: Jacques Wainer
Abstract summary: This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach. These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120.
Score: 1.9580473532948401
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach referred to as the baseline. These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120. We conducted a comparative analysis of 10 under-sampling algorithms, 5 over-sampling algorithms, 2 ensemble methods, and 3 specialized algorithms across eight different performance metrics: accuracy, area under the ROC curve (AUC), balanced accuracy, F1-measure, G-mean, Matthew's correlation coefficient, precision, and recall. Additionally, we assessed the six strategies on altered datasets, derived from real-life data, with both low (3) and high (100 or 300) imbalance ratios (IR). The principal finding indicates that the effectiveness of each strategy significantly varies depending on the metric used. The paper also examines a selection of newer algorithms within the categories of specialized algorithms, oversampling, and ensemble methods. The findings suggest that the current hierarchy of best-performing strategies for each metric is unlikely to change with the introduction of newer algorithms.

Related papers

Silhouette-Guided Instance-Weighted k-means [2.56711111236449]
K-Sil is a silhouette-guided refinement of the k-means algorithm that weights points based on their silhouette scores.<n>It prioritizes well-clustered instances while suppressing borderline or noisy regions.<n>These results establish K-Sil as a principled alternative for applications demanding high-quality, well-separated clusters.
arXiv Detail & Related papers (2025-06-15T15:09:05Z)
Aioli: A Unified Optimization Framework for Language Model Data Mixing [74.50480703834508]
We show that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. We derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions.
arXiv Detail & Related papers (2024-11-08T17:50:24Z)
Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification [0.8287206589886881]
This study comprehensively evaluates three widely-used strategies for handling class imbalance. We compare these methods against a baseline scenario of no-intervention across 15 diverse machine learning models. Our results indicate that all three strategies generally outperform the baseline, with Decision Threshold emerging as the most consistently effective technique.
arXiv Detail & Related papers (2024-09-29T16:02:32Z)
Quantized Hierarchical Federated Learning: A Robust Approach to Statistical Heterogeneity [3.8798345704175534]
We present a novel hierarchical federated learning algorithm that incorporates quantization for communication-efficiency. We offer a comprehensive analytical framework to evaluate its optimality gap and convergence rate. Our findings reveal that our algorithm consistently achieves high learning accuracy over a range of parameters.
arXiv Detail & Related papers (2024-03-03T15:40:24Z)
From Variability to Stability: Advancing RecSys Benchmarking Practices [3.3331198926331784]
This paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms. By utilizing a diverse set of $30$ open datasets, including two introduced in this work, we critically examine the influence of dataset characteristics on algorithm performance.
arXiv Detail & Related papers (2024-02-15T07:35:52Z)
DBGSA: A Novel Data Adaptive Bregman Clustering Algorithm [2.0232038310495435]
We present a clustering algorithm that is highly sensitive to the initial selection and robustness of datasets. Extensive experiments are conducted on four simulated datasets six real datasets. Results demonstrate that our algorithm improves the accuracy of various algorithms by an average of 63.8%.
arXiv Detail & Related papers (2023-07-25T16:37:09Z)
An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification [0.0]
The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. Standard classification algorithms tend to perform poorly when trained on imbalanced data. We present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data.
arXiv Detail & Related papers (2022-08-25T03:45:34Z)
Regularization Penalty Optimization for Addressing Data Quality Variance in OoD Algorithms [45.02465532852302]
We theoretically reveal the relationship between training data quality and algorithm performance. A novel algorithm is proposed to alleviate the influence of low-quality data at both the sample level and the domain level.
arXiv Detail & Related papers (2022-06-12T14:36:04Z)
A Priori Denoising Strategies for Sparse Identification of Nonlinear Dynamical Systems: A Comparative Study [68.8204255655161]
We investigate and compare the performance of several local and global smoothing techniques to a priori denoise the state measurements. We show that, in general, global methods, which use the entire measurement data set, outperform local methods, which employ a neighboring data subset around a local point.
arXiv Detail & Related papers (2022-01-29T23:31:25Z)
Amortized Implicit Differentiation for Stochastic Bilevel Optimization [53.12363770169761]
We study a class of algorithms for solving bilevel optimization problems in both deterministic and deterministic settings. We exploit a warm-start strategy to amortize the estimation of the exact gradient. By using this framework, our analysis shows these algorithms to match the computational complexity of methods that have access to an unbiased estimate of the gradient.
arXiv Detail & Related papers (2021-11-29T15:10:09Z)
Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms. The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm. As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments. We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data. Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.