The Effect of Balancing Methods on Model Behavior in Imbalanced
Classification Problems
- URL: http://arxiv.org/abs/2307.00157v1
- Date: Fri, 30 Jun 2023 22:25:01 GMT
- Title: The Effect of Balancing Methods on Model Behavior in Imbalanced
Classification Problems
- Authors: Adrian Stando, Mustafa Cavus, Przemys{\l}aw Biecek
- Abstract summary: Imbalanced data poses a challenge in classification as model performance is affected by insufficient learning from minority classes.
This study addresses a more challenging aspect of balancing methods - their impact on model behavior.
To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing.
- Score: 4.370097023410272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imbalanced data poses a significant challenge in classification as model
performance is affected by insufficient learning from minority classes.
Balancing methods are often used to address this problem. However, such
techniques can lead to problems such as overfitting or loss of information.
This study addresses a more challenging aspect of balancing methods - their
impact on model behavior. To capture these changes, Explainable Artificial
Intelligence tools are used to compare models trained on datasets before and
after balancing. In addition to the variable importance method, this study uses
the partial dependence profile and accumulated local effects techniques. Real
and simulated datasets are tested, and an open-source Python package edgaro is
developed to facilitate this analysis. The results obtained show significant
changes in model behavior due to balancing methods, which can lead to biased
models toward a balanced distribution. These findings confirm that balancing
analysis should go beyond model performance comparisons to achieve higher
reliability of machine learning models. Therefore, we propose a new method
performance gain plot for informed data balancing strategy to make an optimal
selection of balancing method by analyzing the measure of change in model
behavior versus performance gain.
Related papers
- Rethinking the Bias of Foundation Model under Long-tailed Distribution [18.80942166783087]
We find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance.
During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies.
We propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels.
arXiv Detail & Related papers (2025-01-27T11:00:19Z) - Investigating the Impact of Balancing, Filtering, and Complexity on Predictive Multiplicity: A Data-Centric Perspective [5.524804393257921]
Rashomon effect occurs when multiple models achieve similar performance on a dataset but produce different predictions, resulting in predictive multiplicity.
Data-centric AI approaches can mitigate these problems by prioritizing data optimization, particularly through preprocessing techniques.
This paper investigates how data preprocessing techniques like balancing and filtering methods impact predictive multiplicity and model stability, considering the complexity of the data.
arXiv Detail & Related papers (2024-12-12T20:14:45Z) - Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration [74.09687562334682]
We introduce a novel training data attribution method called Debias and Denoise Attribution (DDA)
Our method significantly outperforms existing approaches, achieving an averaged AUC of 91.64%.
DDA exhibits strong generality and scalability across various sources and different-scale models like LLaMA2, QWEN2, and Mistral.
arXiv Detail & Related papers (2024-10-02T07:14:26Z) - Explainability of Machine Learning Models under Missing Data [3.0485328005356136]
Missing data is a prevalent issue that can significantly impair model performance and explainability.
This paper briefly summarizes the development of the field of missing data and investigates the effects of various imputation methods on SHAP.
arXiv Detail & Related papers (2024-06-29T11:31:09Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification [0.0]
This paper examines the impact of balancing methods on predictive multiplicity using the Rashomon effect.
It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models.
arXiv Detail & Related papers (2024-03-22T13:08:22Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - How robust are pre-trained models to distribution shift? [82.08946007821184]
We show how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE)
We develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation.
arXiv Detail & Related papers (2022-06-17T16:18:28Z) - Analyzing the Effects of Handling Data Imbalance on Learned Features
from Medical Images by Looking Into the Models [50.537859423741644]
Training a model on an imbalanced dataset can introduce unique challenges to the learning problem.
We look deeper into the internal units of neural networks to observe how handling data imbalance affects the learned features.
arXiv Detail & Related papers (2022-04-04T09:38:38Z) - Sampling To Improve Predictions For Underrepresented Observations In
Imbalanced Data [0.0]
Data imbalance negatively impacts the predictive performance of models on underrepresented observations.
We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data.
We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production.
arXiv Detail & Related papers (2021-11-17T12:16:54Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.