Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP
- URL: http://arxiv.org/abs/2310.19752v1
- Date: Mon, 30 Oct 2023 17:22:02 GMT
- Title: Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP
- Authors: Qi Qian, Yuanhong Xu, Juhua Hu
- Abstract summary: InMaP can obtain the vision proxy within one minute on a single GPU while improving the zero-shot accuracy from $77.02%$ to $80.21%$ on ImageNet with ViT-L/14@336 pre-trained by CLIP.
- Score: 15.48717971754816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training methods, e.g., CLIP, demonstrate an impressive
zero-shot performance on visual categorizations with the class proxy from the
text embedding of the class name. However, the modality gap between the text
and vision space can result in a sub-optimal performance. We theoretically show
that the gap cannot be reduced sufficiently by minimizing the contrastive loss
in CLIP and the optimal proxy for vision tasks may reside only in the vision
space. Therefore, given unlabeled target vision data, we propose to learn the
vision proxy directly with the help from the text proxy for zero-shot transfer.
Moreover, according to our theoretical analysis, strategies are developed to
further refine the pseudo label obtained by the text proxy to facilitate the
intra-modal proxy learning (InMaP) for vision. Experiments on extensive
downstream tasks confirm the effectiveness and efficiency of our proposal.
Concretely, InMaP can obtain the vision proxy within one minute on a single GPU
while improving the zero-shot accuracy from $77.02\%$ to $80.21\%$ on ImageNet
with ViT-L/14@336 pre-trained by CLIP. Code is available at
\url{https://github.com/idstcv/InMaP}.
Related papers
- Online Zero-Shot Classification with CLIP [9.099027915077698]
We study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction.
Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service.
Our online zero-shot transfer method (OnZeta) achieves $78.94%$ accuracy on ImageNet without accessing the entire data set.
arXiv Detail & Related papers (2024-08-23T18:12:12Z) - Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation [82.95830628372845]
This paper introduces a collaborative vision-text optimizing mechanism within the Open-Vocabulary encoder (OVS) field.
To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field.
In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively.
arXiv Detail & Related papers (2024-08-01T17:48:08Z) - SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference [11.453253140479166]
We enhance contrastive language-image pretraining's potential for semantic segmentation.
By rethinking self-attention, we find that CLIP can adapt to dense prediction tasks.
We replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module.
arXiv Detail & Related papers (2023-12-04T03:18:46Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment [102.17010696898113]
We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
arXiv Detail & Related papers (2022-03-14T15:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.