Related papers: Masked Unsupervised Self-training for Zero-shot Image Classification

Masked Unsupervised Self-training for Zero-shot Image Classification

URL: http://arxiv.org/abs/2206.02967v1
Date: Tue, 7 Jun 2022 02:03:06 GMT
Title: Masked Unsupervised Self-training for Zero-shot Image Classification
Authors: Junnan Li, Silvio Savarese, Steven C.H. Hoi
Abstract summary: Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
Score: 98.23094305347709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved impressive progress, it still requires a second stage of finetuning on labeled data. On the other hand, models pre-trained with large-scale text-image supervision (e.g., CLIP) have enabled zero-shot transfer to downstream image classification tasks. However, the zero-shot performance of CLIP-like models are often insufficient for real-world adoption. In this paper, we aim to leverage the abundant unlabeled data to improve the performance of a pre-trained zero-shot classifier on downstream tasks. We propose Masked Unsupervised Self-Training (MUST), a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST jointly optimizes three objectives to learn both class-level global feature and pixel-level local feature and enforces a regularization between the two. We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification. For instance, MUST achieves a zero-shot top-1 accuracy of 77.7% on ImageNet using ViT-B, +9.4% higher than CLIP. Our code is available at https://github.com/salesforce/MUST.

Related papers

Online Zero-Shot Classification with CLIP [9.099027915077698]
We study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service. Our online zero-shot transfer method (OnZeta) achieves $78.94%$ accuracy on ImageNet without accessing the entire data set.
arXiv Detail & Related papers (2024-08-23T18:12:12Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z)
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models. We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability. Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z)
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods. We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data. Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z)
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
Self-Supervised Classification Network [3.8073142980733]
Self-supervised end-to-end classification neural network learns labels and representations simultaneously. First unsupervised end-to-end classification network to perform well on the large-scale ImageNet dataset.
arXiv Detail & Related papers (2021-03-19T19:29:42Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled. A self-supervised task from representation learning is employed to obtain semantically meaningful features. We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.