CPL: Counterfactual Prompt Learning for Vision and Language Models
- URL: http://arxiv.org/abs/2210.10362v1
- Date: Wed, 19 Oct 2022 08:06:39 GMT
- Title: CPL: Counterfactual Prompt Learning for Vision and Language Models
- Authors: Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun
Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang
- Abstract summary: This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
- Score: 76.18024920393245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt tuning is a new few-shot transfer learning technique that only tunes
the learnable prompt for pre-trained vision and language models such as CLIP.
However, existing prompt tuning methods tend to learn spurious or entangled
representations, which leads to poor generalization to unseen concepts. Towards
non-spurious and efficient prompt learning from limited examples, this paper
presents a novel \underline{\textbf{C}}ounterfactual
\underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for
vision and language models, which simultaneously employs counterfactual
generation and contrastive learning in a joint optimization framework.
Particularly, CPL constructs counterfactual by identifying minimal non-spurious
feature change between semantically-similar positive and negative samples that
causes concept change, and learns more generalizable prompt representation from
both factual and counterfactual examples via contrastive learning. Extensive
experiments demonstrate that CPL can obtain superior few-shot performance on
different vision and language tasks than previous prompt tuning methods on
CLIP. On image classification, we achieve 3.55\% average relative improvement
on unseen classes across seven datasets; on image-text retrieval and visual
question answering, we gain up to 4.09\% and 25.08\% relative improvements
across three few-shot scenarios on unseen test sets respectively.
Related papers
- In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model [13.983810804606264]
We propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks.
InCPL associates a new test sample with very few labeled examples as context information.
We introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples.
arXiv Detail & Related papers (2024-03-10T08:15:51Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.