Unsupervised Prompt Learning for Vision-Language Models
- URL: http://arxiv.org/abs/2204.03649v1
- Date: Thu, 7 Apr 2022 17:59:57 GMT
- Title: Unsupervised Prompt Learning for Vision-Language Models
- Authors: Tony Huang, Jack Chu, Fangyun Wei
- Abstract summary: We propose an unsupervised prompt learning (UPL) framework to improve the zero-shot transfer of CLIP-like vision-language models.
An enhanced version of UPL is even on par with the 8-shot CoOp and the 8-shot TIP-Adapter on most datasets.
- Score: 12.259694415428026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive vision-language models like CLIP have shown great progress in
zero-shot transfer learning. This new paradigm uses large-scale image-text
pairs for training and aligns images and texts in a common embedding space. In
the inference stage, the proper text description, known as prompt, needs to be
carefully designed for zero-shot transfer. To avoid laborious prompt
engineering and simultaneously improve transfer performance, recent works such
as CoOp, CLIP-Adapter and Tip-Adapter propose to adapt vision-language models
for downstream image recognition tasks by either optimizing the continuous
prompt representations or training an additional adapter network on top of the
pre-trained vision-language models on a small set of labeled data. Though
promising improvements are achieved, using labeled images from target datasets
may violate the intention of zero-shot transfer of pre-trained vision-language
models. In this paper, we propose an unsupervised prompt learning (UPL)
framework, which does not require any annotations of the target dataset, to
improve the zero-shot transfer of CLIP-like vision-language models.
Experimentally, for zero-shot transfer, our UPL outperforms original CLIP with
prompt engineering and on ImageNet as well as other 10 datasets. An enhanced
version of UPL is even on par with the 8-shot CoOp and the 8-shot TIP-Adapter
on most datasets while our method does not need any labeled images for
training. Code and models are available at
https://github.com/tonyhuang2022/UPL.
Related papers
- Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning [22.93684323791136]
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.
We introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance ICCC's zero-shot performance without the need for labeled task.
Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based tasks through ICCC instruction tuning.
arXiv Detail & Related papers (2024-04-01T04:28:01Z) - Contrastive Vision-Language Alignment Makes Efficient Instruction
Learner [31.281236193979165]
We study the task of extending the large language model (LLM) into a vision-language instruction-following model.
Existing methods typically train a visual adapter to align the representation between a pre-trained vision transformer (ViT) and the LLM by a generative image captioning loss.
We propose CG-VLM that applies Contrastive and Generative alignment objectives to effectively align the representation of ViT and LLM.
arXiv Detail & Related papers (2023-11-29T03:29:46Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained
Models [9.017387427570538]
Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs.
Due to their size, fine-tuning these models on new datasets can be prohibitively expensive, both in terms of the supervision and compute required.
We present a new approach called SVL-Adapter that combines the complementary strengths of both vision-language pretraining and self-supervised representation learning.
arXiv Detail & Related papers (2022-10-07T19:35:08Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - Learning Transferable Visual Models From Natural Language Supervision [13.866297967166089]
Learning directly from raw text about images is a promising alternative.
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn.
SOTA image representations are learned from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
arXiv Detail & Related papers (2021-02-26T19:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.