Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
- URL: http://arxiv.org/abs/2303.17169v1
- Date: Thu, 30 Mar 2023 06:02:40 GMT
- Title: Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
- Authors: Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu,
Luping Zhou, Shengsheng Wang, Jingdong Wang
- Abstract summary: We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
- Score: 52.3032592038514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt learning has become one of the most efficient paradigms for adapting
large pre-trained vision-language models to downstream tasks. Current
state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to
learn an appropriate prompt for each specific task. Recent CoCoOp further
boosts the base-to-new generalization performance via an image-conditional
prompt. However, it directly fuses identical image semantics to prompts of
different labels and significantly weakens the discrimination among different
classes as shown in our experiments. Motivated by this observation, we first
propose a class-aware text prompt (CTP) to enrich generated prompts with
label-related image information. Unlike CoCoOp, CTP can effectively involve
image semantics and avoid introducing extra ambiguities into different prompts.
On the other hand, instead of reserving the complete image representations, we
propose text-guided feature tuning (TFT) to make the image branch attend to
class-related representation. A contrastive loss is employed to align such
augmented text and image representations on downstream tasks. In this way, the
image-to-text CTP and text-to-image TFT can be mutually promoted to enhance the
adaptation of VLMs for downstream tasks. Extensive experiments demonstrate that
our method outperforms the existing methods by a significant margin.
Especially, compared to CoCoOp, we achieve an average improvement of 4.03% on
new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
Related papers
- CoAPT: Context Attribute words for Prompt Tuning [5.811993982861212]
We propose a novel prompt tuning method called CoAPT for few/zero-shot image classification.
The core motivation is that attributes are descriptive words with rich information about a given concept.
CoAPT integrates words as additional prompts within learnable prompt tuning and can be easily incorporated into various existing prompt tuning methods.
arXiv Detail & Related papers (2024-07-18T08:58:01Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Learning to Compose Diversified Prompts for Image Emotion Classification [5.586293129420233]
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models.
CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering.
We propose a general framework that shows how CLIP can be effectively applied to Image Emotion Classification.
arXiv Detail & Related papers (2022-01-26T14:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.