Comparing Importance Sampling Based Methods for Mitigating the Effect of
Class Imbalance
- URL: http://arxiv.org/abs/2402.18742v1
- Date: Wed, 28 Feb 2024 22:52:27 GMT
- Title: Comparing Importance Sampling Based Methods for Mitigating the Effect of
Class Imbalance
- Authors: Indu Panigrahi and Richard Zhu
- Abstract summary: We compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling.
We find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes.
Our findings also indicate that there may exist some redundancy in data in the Planet dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most state-of-the-art computer vision models heavily depend on data. However,
many datasets exhibit extreme class imbalance which has been shown to
negatively impact model performance. Among the training-time and
data-generation solutions that have been explored, one subset that leverages
existing data is importance sampling. A good deal of this work focuses
primarily on the CIFAR-10 and CIFAR-100 datasets which fail to be
representative of the scale, composition, and complexity of current
state-of-the-art datasets. In this work, we explore and compare three
techniques that derive from importance sampling: loss reweighting,
undersampling, and oversampling. Specifically, we compare the effect of these
techniques on the performance of two encoders on an impactful satellite imagery
dataset, Planet's Amazon Rainforest dataset, in preparation for another work.
Furthermore, we perform supplemental experimentation on a scene classification
dataset, ADE20K, to test on a contrasting domain and clarify our results.
Across both types of encoders, we find that up-weighting the loss for and
undersampling has a negigible effect on the performance on underrepresented
classes. Additionally, our results suggest oversampling generally improves
performance for the same underrepresented classes. Interestingly, our findings
also indicate that there may exist some redundancy in data in the Planet
dataset. Our work aims to provide a foundation for further work on the Planet
dataset and similar domain-specific datasets. We open-source our code at
https://github.com/RichardZhu123/514-class-imbalance for future work on other
satellite imagery datasets as well.
Related papers
- UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Class Imbalance in Object Detection: An Experimental Diagnosis and Study
of Mitigation Strategies [0.5439020425818999]
This study introduces a benchmarking framework utilizing the YOLOv5 single-stage detector to address the problem of foreground-foreground class imbalance.
We scrutinized three established techniques: sampling, loss weighing, and data augmentation.
Our comparative analysis reveals that sampling and loss reweighing methods, while shown to be beneficial in two-stage detector settings, do not translate as effectively in improving YOLOv5's performance.
arXiv Detail & Related papers (2024-03-11T19:06:04Z) - Feedback-guided Data Synthesis for Imbalanced Classification [10.836265321046561]
We introduce a framework for augmenting static datasets with useful synthetic samples.
We find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse.
On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes.
arXiv Detail & Related papers (2023-09-29T21:47:57Z) - DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets [4.833815605196965]
This paper presents a novel method for addressing data imbalance in machine learning.
It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering.
It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
arXiv Detail & Related papers (2023-08-19T02:11:49Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - A Data-Based Perspective on Transfer Learning [76.30206800557411]
We take a closer look at the role of the source dataset's composition in transfer learning.
Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness.
arXiv Detail & Related papers (2022-07-12T17:58:28Z) - Free Lunch for Co-Saliency Detection: Context Adjustment [14.688461235328306]
We propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples.
We collect a novel dataset called Context Adjustment Training. The two variants of our dataset, i.e., CAT and CAT+, consist of 16,750 and 33,500 images, respectively.
arXiv Detail & Related papers (2021-08-04T14:51:37Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.