Masked Unsupervised Self-training for Zero-shot Image Classification
- URL: http://arxiv.org/abs/2206.02967v1
- Date: Tue, 7 Jun 2022 02:03:06 GMT
- Title: Masked Unsupervised Self-training for Zero-shot Image Classification
- Authors: Junnan Li, Silvio Savarese, Steven C.H. Hoi
- Abstract summary: Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
- Score: 98.23094305347709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art computer vision models are mostly trained with supervised
learning using human-labeled images, which limits their scalability due to the
expensive annotation cost. While self-supervised representation learning has
achieved impressive progress, it still requires a second stage of finetuning on
labeled data. On the other hand, models pre-trained with large-scale text-image
supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image
classification tasks. However, the zero-shot performance of CLIP-like models
are often insufficient for real-world adoption. In this paper, we aim to
leverage the abundant unlabeled data to improve the performance of a
pre-trained zero-shot classifier on downstream tasks. We propose Masked
Unsupervised Self-Training (MUST), a new approach which leverages two different
and complimentary sources of supervision: pseudo-labels and raw images. MUST
jointly optimizes three objectives to learn both class-level global feature and
pixel-level local feature and enforces a regularization between the two. We
demonstrate the efficacy of MUST on 8 downstream tasks across a variety of
domains, where it improves upon CLIP by a large margin and narrows the
performance gap between unsupervised and supervised classification. For
instance, MUST achieves a zero-shot top-1 accuracy of 77.7% on ImageNet using
ViT-B, +9.4% higher than CLIP. Our code is available at
https://github.com/salesforce/MUST.
Related papers
- Online Zero-Shot Classification with CLIP [9.099027915077698]
We study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction.
Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service.
Our online zero-shot transfer method (OnZeta) achieves $78.94%$ accuracy on ImageNet without accessing the entire data set.
arXiv Detail & Related papers (2024-08-23T18:12:12Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free
Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models.
We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z) - Self-Supervised Classification Network [3.8073142980733]
Self-supervised end-to-end classification neural network learns labels and representations simultaneously.
First unsupervised end-to-end classification network to perform well on the large-scale ImageNet dataset.
arXiv Detail & Related papers (2021-03-19T19:29:42Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled.
A self-supervised task from representation learning is employed to obtain semantically meaningful features.
We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.