BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced
Datasets
- URL: http://arxiv.org/abs/2203.05651v1
- Date: Thu, 10 Mar 2022 21:34:08 GMT
- Title: BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced
Datasets
- Authors: Suraj Kothawade, Pavan Kumar Reddy, Ganesh Ramakrishnan, Rishabh Iyer
- Abstract summary: Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets.
We propose BASIL, a novel algorithm that optimize the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop.
- Score: 14.739359755029353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current semi-supervised learning (SSL) methods assume a balance between the
number of data points available for each class in both the labeled and the
unlabeled data sets. However, there naturally exists a class imbalance in most
real-world datasets. It is known that training models on such imbalanced
datasets leads to biased models, which in turn lead to biased predictions
towards the more frequent classes. This issue is further pronounced in SSL
methods, as they would use this biased model to obtain psuedo-labels (on the
unlabeled data) during training. In this paper, we tackle this problem by
attempting to select a balanced labeled dataset for SSL that would result in an
unbiased model. Unfortunately, acquiring a balanced labeled dataset from a
class imbalanced distribution in one shot is challenging. We propose BASIL
(Balanced Active Semi-supervIsed Learning), a novel algorithm that optimizes
the submodular mutual information (SMI) functions in a per-class fashion to
gradually select a balanced dataset in an active learning loop. Importantly,
our technique can be efficiently used to improve the performance of any SSL
method. Our experiments on Path-MNIST and Organ-MNIST medical datasets for a
wide array of SSL methods show the effectiveness of Basil. Furthermore, we
observe that Basil outperforms the state-of-the-art diversity and uncertainty
based active learning methods since the SMI functions select a more balanced
dataset.
Related papers
- A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - Learning Label Refinement and Threshold Adjustment for Imbalanced Semi-Supervised Learning [6.904448748214652]
Semi-supervised learning algorithms struggle to perform well when exposed to imbalanced training data.
We introduce SEmi-supervised learning with pseudo-label optimization based on VALidation data (SEVAL)
SEVAL adapts to specific tasks with improved pseudo-labels accuracy and ensures pseudo-labels correctness on a per-class basis.
arXiv Detail & Related papers (2024-07-07T13:46:22Z) - BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning [21.53320689054414]
This paper introduces the Balanced and Entropy-based Mix (BEM), a pioneering mixing approach to re-balance the class distribution of both data quantity and uncertainty.
Experimental results show that BEM significantly enhances various LTSSL frameworks and achieves state-of-the-art performances across multiple benchmarks.
arXiv Detail & Related papers (2024-04-01T15:31:04Z) - On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning [50.48888534815361]
In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL.
PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data.
We propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC)
arXiv Detail & Related papers (2023-01-15T03:21:59Z) - An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised
Learning [103.65758569417702]
Semi-supervised learning (SSL) has shown great promise in leveraging unlabeled data to improve model performance.
We consider a more realistic and challenging setting called imbalanced SSL, where imbalanced class distributions occur in both labeled and unlabeled data.
We study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance by simply supplementing labeled data with pseudo-labels.
arXiv Detail & Related papers (2022-11-20T21:18:41Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - ABC: Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised
Learning [6.866717993664787]
Existing semi-supervised learning (SSL) algorithms assume class-balanced datasets.
We propose a scalable class-imbalanced SSL algorithm that can effectively use unlabeled data.
The proposed algorithm achieves state-of-the-art performance in various class-imbalanced SSL experiments using four benchmark datasets.
arXiv Detail & Related papers (2021-10-20T04:07:48Z) - Self-supervised Learning is More Robust to Dataset Imbalance [65.84339596595383]
We investigate self-supervised learning under dataset imbalance.
Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations.
We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
arXiv Detail & Related papers (2021-10-11T06:29:56Z) - Distribution Aligning Refinery of Pseudo-label for Imbalanced
Semi-supervised Learning [126.31716228319902]
We develop Distribution Aligning Refinery of Pseudo-label (DARP) algorithm.
We show that DARP is provably and efficiently compatible with state-of-the-art SSL schemes.
arXiv Detail & Related papers (2020-07-17T09:16:05Z) - Class-Imbalanced Semi-Supervised Learning [33.94685366079589]
Semi-Supervised Learning (SSL) has achieved great success in overcoming the difficulties of labeling and making full use of unlabeled data.
We introduce a task of class-imbalanced semi-supervised learning (CISSL), which refers to semi-supervised learning with class-imbalanced data.
Our method shows better performance than the conventional methods in the CISSL environment.
arXiv Detail & Related papers (2020-02-17T07:48:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.