An empirical evaluation of imbalanced data strategies from a
practitioner's point of view
- URL: http://arxiv.org/abs/1810.07168v2
- Date: Fri, 10 Nov 2023 15:54:40 GMT
- Title: An empirical evaluation of imbalanced data strategies from a
practitioner's point of view
- Authors: Jacques Wainer
- Abstract summary: This paper evaluates six strategies for mitigating imbalanced data: oversampling, undersampling, ensemble methods, specialized algorithms, class weight adjustments, and a no-mitigation approach.
These strategies were tested on 58 real-life binary imbalanced datasets with imbalance rates ranging from 3 to 120.
- Score: 1.9580473532948401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper evaluates six strategies for mitigating imbalanced data:
oversampling, undersampling, ensemble methods, specialized algorithms, class
weight adjustments, and a no-mitigation approach referred to as the baseline.
These strategies were tested on 58 real-life binary imbalanced datasets with
imbalance rates ranging from 3 to 120. We conducted a comparative analysis of
10 under-sampling algorithms, 5 over-sampling algorithms, 2 ensemble methods,
and 3 specialized algorithms across eight different performance metrics:
accuracy, area under the ROC curve (AUC), balanced accuracy, F1-measure,
G-mean, Matthew's correlation coefficient, precision, and recall. Additionally,
we assessed the six strategies on altered datasets, derived from real-life
data, with both low (3) and high (100 or 300) imbalance ratios (IR).
The principal finding indicates that the effectiveness of each strategy
significantly varies depending on the metric used. The paper also examines a
selection of newer algorithms within the categories of specialized algorithms,
oversampling, and ensemble methods. The findings suggest that the current
hierarchy of best-performing strategies for each metric is unlikely to change
with the introduction of newer algorithms.
Related papers
- Quantized Hierarchical Federated Learning: A Robust Approach to
Statistical Heterogeneity [3.8798345704175534]
We present a novel hierarchical federated learning algorithm that incorporates quantization for communication-efficiency.
We offer a comprehensive analytical framework to evaluate its optimality gap and convergence rate.
Our findings reveal that our algorithm consistently achieves high learning accuracy over a range of parameters.
arXiv Detail & Related papers (2024-03-03T15:40:24Z) - From Variability to Stability: Advancing RecSys Benchmarking Practices [3.458464808497421]
This paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms.
By utilizing a diverse set of $30$ open datasets, including two introduced in this work, we critically examine the influence of dataset characteristics on algorithm performance.
arXiv Detail & Related papers (2024-02-15T07:35:52Z) - DBGSA: A Novel Data Adaptive Bregman Clustering Algorithm [2.0232038310495435]
We present a clustering algorithm that is highly sensitive to the initial selection and robustness of datasets.
Extensive experiments are conducted on four simulated datasets six real datasets.
Results demonstrate that our algorithm improves the accuracy of various algorithms by an average of 63.8%.
arXiv Detail & Related papers (2023-07-25T16:37:09Z) - An Empirical Analysis of the Efficacy of Different Sampling Techniques
for Imbalanced Classification [0.0]
The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue.
Standard classification algorithms tend to perform poorly when trained on imbalanced data.
We present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data.
arXiv Detail & Related papers (2022-08-25T03:45:34Z) - Regularization Penalty Optimization for Addressing Data Quality Variance
in OoD Algorithms [45.02465532852302]
We theoretically reveal the relationship between training data quality and algorithm performance.
A novel algorithm is proposed to alleviate the influence of low-quality data at both the sample level and the domain level.
arXiv Detail & Related papers (2022-06-12T14:36:04Z) - A Priori Denoising Strategies for Sparse Identification of Nonlinear
Dynamical Systems: A Comparative Study [68.8204255655161]
We investigate and compare the performance of several local and global smoothing techniques to a priori denoise the state measurements.
We show that, in general, global methods, which use the entire measurement data set, outperform local methods, which employ a neighboring data subset around a local point.
arXiv Detail & Related papers (2022-01-29T23:31:25Z) - Amortized Implicit Differentiation for Stochastic Bilevel Optimization [53.12363770169761]
We study a class of algorithms for solving bilevel optimization problems in both deterministic and deterministic settings.
We exploit a warm-start strategy to amortize the estimation of the exact gradient.
By using this framework, our analysis shows these algorithms to match the computational complexity of methods that have access to an unbiased estimate of the gradient.
arXiv Detail & Related papers (2021-11-29T15:10:09Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments.
We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data.
Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.