Online Zero-Shot Classification with CLIP
- URL: http://arxiv.org/abs/2408.13320v1
- Date: Fri, 23 Aug 2024 18:12:12 GMT
- Title: Online Zero-Shot Classification with CLIP
- Authors: Qi Qian, Juhua Hu,
- Abstract summary: We study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction.
Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service.
Our online zero-shot transfer method (OnZeta) achieves $78.94%$ accuracy on ImageNet without accessing the entire data set.
- Score: 9.099027915077698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution from the target data has not been leveraged sufficiently. In this work, we study a novel online zero-shot transfer scenario, where each image arrives in a random order for classification and is visited only once to obtain prediction immediately without storing its representation. Compared with the vanilla zero-shot classification, the proposed framework preserves its flexibility for online service while considering the statistics of the arrived images as the side information to capture the distribution of target data, which can help improve the performance of real-world applications. To tackle the challenge of effective online optimization, we first develop online label learning to model the target data distribution. Then, the proxy of each class in the vision space is further optimized with the proposed online proxy learning method to mitigate the modality gap between images and text. The convergence of both online strategies can be theoretically guaranteed. By combining the predicted label from the online label learning and proxy learning, our online zero-shot transfer method (OnZeta) achieves $78.94\%$ accuracy on ImageNet without accessing the entire data set. Moreover, extensive experiments on other 13 downstream tasks with different vision encoders show a more than $3\%$ improvement on average, which demonstrates the effectiveness of our proposal. Code is available at \url{https://github.com/idstcv/OnZeta}.
Related papers
- Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting [55.361337202198925]
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
We propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data.
arXiv Detail & Related papers (2024-10-25T04:00:45Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge.
Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently.
Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP [15.48717971754816]
InMaP can obtain the vision proxy within one minute on a single GPU while improving the zero-shot accuracy from $77.02%$ to $80.21%$ on ImageNet with ViT-L/14@336 pre-trained by CLIP.
arXiv Detail & Related papers (2023-10-30T17:22:02Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding [86.79903269137971]
Unsupervised visual grounding has been developed to locate regions using pseudo-labels.
We propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.
Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-05-15T14:42:02Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.