Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
- URL: http://arxiv.org/abs/2303.17169v1
- Date: Thu, 30 Mar 2023 06:02:40 GMT
- Title: Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
- Authors: Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu,
Luping Zhou, Shengsheng Wang, Jingdong Wang
- Abstract summary: We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
- Score: 52.3032592038514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt learning has become one of the most efficient paradigms for adapting
large pre-trained vision-language models to downstream tasks. Current
state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to
learn an appropriate prompt for each specific task. Recent CoCoOp further
boosts the base-to-new generalization performance via an image-conditional
prompt. However, it directly fuses identical image semantics to prompts of
different labels and significantly weakens the discrimination among different
classes as shown in our experiments. Motivated by this observation, we first
propose a class-aware text prompt (CTP) to enrich generated prompts with
label-related image information. Unlike CoCoOp, CTP can effectively involve
image semantics and avoid introducing extra ambiguities into different prompts.
On the other hand, instead of reserving the complete image representations, we
propose text-guided feature tuning (TFT) to make the image branch attend to
class-related representation. A contrastive loss is employed to align such
augmented text and image representations on downstream tasks. In this way, the
image-to-text CTP and text-to-image TFT can be mutually promoted to enhance the
adaptation of VLMs for downstream tasks. Extensive experiments demonstrate that
our method outperforms the existing methods by a significant margin.
Especially, compared to CoCoOp, we achieve an average improvement of 4.03% on
new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
Related papers
- Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP [22.33658954569737]
We build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component and a Text-Guided-Image (TGI) component.
Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method.
We propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost.
arXiv Detail & Related papers (2024-12-16T02:03:45Z) - Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training [30.071860810401933]
This paper advances contrastive language-image pre-training (CLIP) into one novel holistic paradigm.
We use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies.
Our holistic CLIP significantly outperforms existing CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
arXiv Detail & Related papers (2024-11-30T11:27:58Z) - IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Learning to Compose Diversified Prompts for Image Emotion Classification [5.586293129420233]
Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models.
CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering.
We propose a general framework that shows how CLIP can be effectively applied to Image Emotion Classification.
arXiv Detail & Related papers (2022-01-26T14:31:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.