Related papers: The Hidden Influence of Latent Feature Magnitude When Learning with Imbalanced Data

The Hidden Influence of Latent Feature Magnitude When Learning with Imbalanced Data

URL: http://arxiv.org/abs/2407.10165v1
Date: Sun, 14 Jul 2024 11:20:50 GMT
Title: The Hidden Influence of Latent Feature Magnitude When Learning with Imbalanced Data
Authors: Damien A. Dablain, Nitesh V. Chawla,
Abstract summary: We show that one of the central causes of impaired generalization when learning with imbalanced data is the inherent manner in which ML models perform inference. We demonstrate that even with aggressive data augmentation, which generally improves minority class prediction accuracy, parametric ML models still associate a class label with a limited number of feature combinations.
Score: 22.521678971526253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning (ML) models have difficulty generalizing when the number of training class instances are numerically imbalanced. The problem of generalization in the face of data imbalance has largely been attributed to the lack of training data for under-represented classes and to feature overlap. The typical remedy is to implement data augmentation for classes with fewer instances or to assign a higher cost to minority class prediction errors or to undersample the prevalent class. However, we show that one of the central causes of impaired generalization when learning with imbalanced data is the inherent manner in which ML models perform inference. These models have difficulty generalizing due to their heavy reliance on the magnitude of encoded signals. During inference, the models predict classes based on a combination of encoded signal magnitudes that linearly sum to the largest scalar. We demonstrate that even with aggressive data augmentation, which generally improves minority class prediction accuracy, parametric ML models still associate a class label with a limited number of feature combinations that sum to a prediction, which can affect generalization.

Related papers

Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime. We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function. We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z)
A statistical theory of overfitting for imbalanced classification [0.6144680854063939]
We develop a statistical theory for high-dimensional imbalanced classification. We find that dimensionality induces truncation or skewing effects on the logit distribution. This phenomenon explains why the minority class is more severely affected by overfitting.
arXiv Detail & Related papers (2025-02-17T00:21:33Z)
Synthetic Feature Augmentation Improves Generalization Performance of Language Models [8.463273762997398]
Training and fine-tuning deep learning models on limited and imbalanced datasets poses substantial challenges. We propose augmenting features in the embedding space by generating synthetic samples using a range of techniques. We validate the effectiveness of this approach across multiple open-source text classification benchmarks.
arXiv Detail & Related papers (2025-01-11T04:31:18Z)
Non-Vacuous Generalization Bounds for Large Language Models [78.42762571499061]
We provide the first non-vacuous generalization bounds for pretrained large language models. We show that larger models have better generalization bounds and are more compressible than smaller models.
arXiv Detail & Related papers (2023-12-28T17:58:42Z)
Uncertainty-guided Boundary Learning for Imbalanced Social Event Detection [64.4350027428928]
We propose a novel uncertainty-guided class imbalance learning framework for imbalanced social event detection tasks. Our model significantly improves social event representation and classification tasks in almost all classes, especially those uncertain ones.
arXiv Detail & Related papers (2023-10-30T03:32:04Z)
When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method [40.25499257944916]
Real-world datasets are both noisily labeled and class-imbalanced. We propose a representation calibration method RCAL. We derive theoretical results to discuss the effectiveness of our representation calibration.
arXiv Detail & Related papers (2022-11-20T11:36:48Z)
On how to avoid exacerbating spurious correlations when models are overparameterized [33.315813572333745]
We show that VS-loss learns a model that is fair towards minorities even when spurious features are strong. Compared to previous works, our bounds hold for more general models, they are non-asymptotic, and, they apply even at scenarios of extreme imbalance.
arXiv Detail & Related papers (2022-06-25T21:53:44Z)
GAN based Data Augmentation to Resolve Class Imbalance [0.0]
In many related tasks, the datasets have a very small number of observed fraud cases. This imbalance presence may impact any learning model's behavior by predicting all labels as the majority class. We trained Generative Adversarial Network(GAN) to generate a large number of convincing (and reliable) synthetic examples of the minority class.
arXiv Detail & Related papers (2022-06-12T21:21:55Z)
Throwing Away Data Improves Worst-Class Error in Imbalanced Classification [36.91428748713018]
Class imbalances pervade classification problems, yet their treatment differs in theory and practice. We take on the challenge of developing learning theory able to describe the worst-class error of classifiers over linearly-separable data.
arXiv Detail & Related papers (2022-05-23T23:43:18Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
Information-Theoretic Generalization Bounds for Iterative Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles. Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z)
Class-Wise Difficulty-Balanced Loss for Solving Class-Imbalance [6.875312133832079]
We propose a novel loss function named Class-wise Difficulty-Balanced loss. It dynamically distributes weights to each sample according to the difficulty of the class that the sample belongs to. The results show that CDB loss consistently outperforms the recently proposed loss functions on class-imbalanced datasets.
arXiv Detail & Related papers (2020-10-05T07:19:19Z)
Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented. It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts. Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z)
An Investigation of Why Overparameterization Exacerbates Spurious Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior. We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.