Integrating Unsupervised Clustering and Label-specific Oversampling to
Tackle Imbalanced Multi-label Data
- URL: http://arxiv.org/abs/2109.12421v1
- Date: Sat, 25 Sep 2021 19:00:00 GMT
- Title: Integrating Unsupervised Clustering and Label-specific Oversampling to
Tackle Imbalanced Multi-label Data
- Authors: Payel Sadhukhan, Arjun Pakrashi, Sarbani Palit, Brian Mac Namee
- Abstract summary: Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset.
Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling.
Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well.
- Score: 13.888344214818733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is often a mixture of very frequent labels and very infrequent labels
in multi-label datatsets. This variation in label frequency, a type class
imbalance, creates a significant challenge for building efficient multi-label
classification algorithms. In this paper, we tackle this problem by proposing a
minority class oversampling scheme, UCLSO, which integrates Unsupervised
Clustering and Label-Specific data Oversampling. Clustering is performed to
find out the key distinct and locally connected regions of a multi-label
dataset (irrespective of the label information). Next, for each label, we
explore the distributions of minority points in the cluster sets. Only the
minority points within a cluster are used to generate the synthetic minority
points that are used for oversampling. Even though the cluster set is the same
across all labels, the distributions of the synthetic minority points will vary
across the labels. The training dataset is augmented with the set of
label-specific synthetic minority points, and classifiers are trained to
predict the relevance of each label independently. Experiments using 12
multi-label datasets and several multi-label algorithms show that the proposed
method performed very well compared to the other competing algorithms.
Related papers
- Label Cluster Chains for Multi-Label Classification [2.072831155509228]
Multi-label classification is a type of supervised machine learning that can simultaneously assign multiple labels to an instance.
We propose a method to chain disjoint correlated label clusters obtained by applying a partition method in the label space.
Our proposal shows that learning and chaining disjoint correlated label clusters can better explore and learn label correlations.
arXiv Detail & Related papers (2024-11-01T11:16:37Z) - Exploiting Conjugate Label Information for Multi-Instance Partial-Label Learning [61.00359941983515]
Multi-instance partial-label learning (MIPL) addresses scenarios where each training sample is represented as a multi-instance bag associated with a candidate label set containing one true label and several false positives.
ELIMIPL exploits the conjugate label information to improve the disambiguation performance.
arXiv Detail & Related papers (2024-08-26T15:49:31Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Disambiguated Attention Embedding for Multi-Instance Partial-Label
Learning [68.56193228008466]
In many real-world tasks, the concerned objects can be represented as a multi-instance bag associated with a candidate label set.
Existing MIPL approach follows the instance-space paradigm by assigning augmented candidate label sets of bags to each instance and aggregating bag-level labels from instance-level labels.
We propose an intuitive algorithm named DEMIPL, i.e., Disambiguated attention Embedding for Multi-Instance Partial-Label learning.
arXiv Detail & Related papers (2023-05-26T13:25:17Z) - Class-Distribution-Aware Pseudo Labeling for Semi-Supervised Multi-Label
Learning [97.88458953075205]
Pseudo-labeling has emerged as a popular and effective approach for utilizing unlabeled data.
This paper proposes a novel solution called Class-Aware Pseudo-Labeling (CAP) that performs pseudo-labeling in a class-aware manner.
arXiv Detail & Related papers (2023-05-04T12:52:18Z) - Bridging the Gap between Model Explanations in Partially Annotated
Multi-label Classification [85.76130799062379]
We study how false negative labels affect the model's explanation.
We propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.
arXiv Detail & Related papers (2023-04-04T14:00:59Z) - An Effective Approach for Multi-label Classification with Missing Labels [8.470008570115146]
We propose a pseudo-label based approach to reduce the cost of annotation without bringing additional complexity to the classification networks.
By designing a novel loss function, we are able to relax the requirement that each instance must contain at least one positive label.
We show that our method can handle the imbalance between positive labels and negative labels, while still outperforming existing missing-label learning approaches.
arXiv Detail & Related papers (2022-10-24T23:13:57Z) - Evaluating Multi-label Classifiers with Noisy Labels [0.7868449549351487]
In the real world, it is more common to deal with noisy datasets than clean datasets.
We present a Context-Based Multi-Label-Classifier (CbMLC) that effectively handles noisy labels.
We show CbMLC yields substantial improvements over the previous methods in most cases.
arXiv Detail & Related papers (2021-02-16T19:50:52Z) - Rank-Consistency Deep Hashing for Scalable Multi-Label Image Search [90.30623718137244]
We propose a novel deep hashing method for scalable multi-label image search.
A new rank-consistency objective is applied to align the similarity orders from two spaces.
A powerful loss function is designed to penalize the samples whose semantic similarity and hamming distance are mismatched.
arXiv Detail & Related papers (2021-02-02T13:46:58Z) - Multi-Label Sampling based on Local Label Imbalance [7.355362369511579]
Class imbalance is an inherent characteristic of multi-label data that hinders most multi-label learning methods.
Existing multi-label sampling approaches alleviate the global imbalance of multi-label datasets.
It is actually the imbalance level within the local neighbourhood of minority class examples that plays a key role in performance degradation.
arXiv Detail & Related papers (2020-05-07T04:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.