Related papers: The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems

The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems

URL: http://arxiv.org/abs/2307.00157v1
Date: Fri, 30 Jun 2023 22:25:01 GMT
Title: The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems
Authors: Adrian Stando, Mustafa Cavus, Przemys{\l}aw Biecek
Abstract summary: Imbalanced data poses a challenge in classification as model performance is affected by insufficient learning from minority classes. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing.
Score: 4.370097023410272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imbalanced data poses a significant challenge in classification as model performance is affected by insufficient learning from minority classes. Balancing methods are often used to address this problem. However, such techniques can lead to problems such as overfitting or loss of information. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing. In addition to the variable importance method, this study uses the partial dependence profile and accumulated local effects techniques. Real and simulated datasets are tested, and an open-source Python package edgaro is developed to facilitate this analysis. The results obtained show significant changes in model behavior due to balancing methods, which can lead to biased models toward a balanced distribution. These findings confirm that balancing analysis should go beyond model performance comparisons to achieve higher reliability of machine learning models. Therefore, we propose a new method performance gain plot for informed data balancing strategy to make an optimal selection of balancing method by analyzing the measure of change in model behavior versus performance gain.

Related papers

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models. We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
Rethinking the Bias of Foundation Model under Long-tailed Distribution [18.80942166783087]
We find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. We propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels.
arXiv Detail & Related papers (2025-01-27T11:00:19Z)
Investigating the Impact of Balancing, Filtering, and Complexity on Predictive Multiplicity: A Data-Centric Perspective [5.524804393257921]
Rashomon effect occurs when multiple models achieve similar performance on a dataset but produce different predictions, resulting in predictive multiplicity. Data-centric AI approaches can mitigate these problems by prioritizing data optimization, particularly through preprocessing techniques. This paper investigates how data preprocessing techniques like balancing and filtering methods impact predictive multiplicity and model stability, considering the complexity of the data.
arXiv Detail & Related papers (2024-12-12T20:14:45Z)
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration [74.09687562334682]
We introduce a novel training data attribution method called Debias and Denoise Attribution (DDA) Our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%. DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
arXiv Detail & Related papers (2024-10-02T07:14:26Z)
Explainability of Machine Learning Models under Missing Data [2.880748930766428]
Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data and investigates the effects of various imputation methods on the calculation of Shapley values.
arXiv Detail & Related papers (2024-06-29T11:31:09Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification [0.0]
This paper examines the impact of balancing methods on predictive multiplicity using the Rashomon effect. It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models.
arXiv Detail & Related papers (2024-03-22T13:08:22Z)
Alleviating the Effect of Data Imbalance on Adversarial Training [26.36714114672729]
We study adversarial training on datasets that obey the long-tailed distribution. We propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT)
arXiv Detail & Related papers (2023-07-14T07:01:48Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
A prediction and behavioural analysis of machine learning methods for modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice. Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares. It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z)
How robust are pre-trained models to distribution shift? [82.08946007821184]
We show how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE) We develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation.
arXiv Detail & Related papers (2022-06-17T16:18:28Z)
Analyzing the Effects of Handling Data Imbalance on Learned Features from Medical Images by Looking Into the Models [50.537859423741644]
Training a model on an imbalanced dataset can introduce unique challenges to the learning problem. We look deeper into the internal units of neural networks to observe how handling data imbalance affects the learned features.
arXiv Detail & Related papers (2022-04-04T09:38:38Z)
Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data [0.0]
Data imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production.
arXiv Detail & Related papers (2021-11-17T12:16:54Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.