Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling
- URL: http://arxiv.org/abs/2111.01004v1
- Date: Mon, 1 Nov 2021 15:09:41 GMT
- Title: Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling
- Authors: Ziyu Jiang, Tianlong Chen, Ting Chen, Zhangyang Wang
- Abstract summary: We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
- Score: 96.8742582581744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning approaches have achieved great success in learning
visual representations with few labels of the target classes. That implies a
tantalizing possibility of scaling them up beyond a curated "seed" benchmark,
to incorporating more unlabeled images from the internet-scale external sources
to enhance its performance. However, in practice, larger amount of unlabeled
data will require more computing resources due to the bigger model size and
longer training needed. Moreover, open-world unlabeled data usually follows an
implicit long-tail class or attribute distribution, many of which also do not
belong to the target classes. Blindly leveraging all unlabeled data hence can
lead to the data imbalance as well as distraction issues. This motivates us to
seek a principled approach to strategically select unlabeled data from an
external source, in order to learn generalizable, balanced and diverse
representations for relevant classes. In this work, we present an open-world
unlabeled data sampling framework called Model-Aware K-center (MAK), which
follows three simple principles: (1) tailness, which encourages sampling of
examples from tail classes, by sorting the empirical contrastive loss
expectation (ECLE) of samples over random data augmentations; (2) proximity,
which rejects the out-of-distribution outliers that may distract training; and
(3) diversity, which ensures diversity in the set of sampled examples.
Empirically, using ImageNet-100-LT (without labels) as the seed dataset and two
"noisy" external data sources, we demonstrate that MAK can consistently improve
both the overall representation quality and the class balancedness of the
learned features, as evaluated via linear classifier evaluation on full-shot
and few-shot settings. The code is available at:
\url{https://github.com/VITA-Group/MAK
Related papers
- Continuous Contrastive Learning for Long-Tailed Semi-Supervised Recognition [50.61991746981703]
Current state-of-the-art LTSSL approaches rely on high-quality pseudo-labels for large-scale unlabeled data.
This paper introduces a novel probabilistic framework that unifies various recent proposals in long-tail learning.
We introduce a continuous contrastive learning method, CCL, extending our framework to unlabeled data using reliable and smoothed pseudo-labels.
arXiv Detail & Related papers (2024-10-08T15:06:10Z) - Safe Semi-Supervised Contrastive Learning Using In-Distribution Data as Positive Examples [3.4546761246181696]
We propose a self-supervised contrastive learning approach to fully exploit a large amount of unlabeled data.
The results show that self-supervised contrastive learning significantly improves classification accuracy.
arXiv Detail & Related papers (2024-08-03T22:33:13Z) - DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets [4.833815605196965]
This paper presents a novel method for addressing data imbalance in machine learning.
It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering.
It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
arXiv Detail & Related papers (2023-08-19T02:11:49Z) - A soft nearest-neighbor framework for continual semi-supervised learning [35.957577587090604]
We propose an approach for continual semi-supervised learning where not all the data samples are labeled.
We leverage the power of nearest-neighbors to nonlinearly partition the feature space and flexibly model the underlying data distribution.
Our method works well on both low and high resolution images and scales seamlessly to more complex datasets.
arXiv Detail & Related papers (2022-12-09T20:03:59Z) - Constructing Balance from Imbalance for Long-tailed Image Recognition [50.6210415377178]
The imbalance between majority (head) classes and minority (tail) classes severely skews the data-driven deep neural networks.
Previous methods tackle with data imbalance from the viewpoints of data distribution, feature space, and model design.
We propose a concise paradigm by progressively adjusting label space and dividing the head classes and tail classes.
Our proposed model also provides a feature evaluation method and paves the way for long-tailed feature learning.
arXiv Detail & Related papers (2022-08-04T10:22:24Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - OpenMatch: Open-set Consistency Regularization for Semi-supervised
Learning with Outliers [71.08167292329028]
We propose a novel Open-set Semi-Supervised Learning (OSSL) approach called OpenMatch.
OpenMatch unifies FixMatch with novelty detection based on one-vs-all (OVA) classifiers.
It achieves state-of-the-art performance on three datasets, and even outperforms a fully supervised model in detecting outliers unseen in unlabeled data on CIFAR10.
arXiv Detail & Related papers (2021-05-28T23:57:15Z) - Out-distribution aware Self-training in an Open World Setting [62.19882458285749]
We leverage unlabeled data in an open world setting to further improve prediction performance.
We introduce out-distribution aware self-training, which includes a careful sample selection strategy.
Our classifiers are by design out-distribution aware and can thus distinguish task-related inputs from unrelated ones.
arXiv Detail & Related papers (2020-12-21T12:25:04Z) - Instance Credibility Inference for Few-Shot Learning [45.577880041135785]
Few-shot learning aims to recognize new objects with extremely limited training data for each category.
This paper presents a simple statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the distribution support of unlabeled instances for few-shot learning.
Our simple approach can establish new state-of-the-arts on four widely used few-shot learning benchmark datasets.
arXiv Detail & Related papers (2020-03-26T12:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.