COMMA: Co-Articulated Multi-Modal Learning
- URL: http://arxiv.org/abs/2401.00268v1
- Date: Sat, 30 Dec 2023 15:47:36 GMT
- Title: COMMA: Co-Articulated Multi-Modal Learning
- Authors: Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng
- Abstract summary: We propose Co-Articulated Multi-Modal Learning (COMMA) to handle the limitations of previous methods.
Our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches.
We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts.
- Score: 39.778958624066185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained large-scale vision-language models such as CLIP have demonstrated
excellent generalizability over a series of downstream tasks. However, they are
sensitive to the variation of input text prompts and need a selection of prompt
templates to achieve satisfactory performance. Recently, various methods have
been proposed to dynamically learn the prompts as the textual inputs to avoid
the requirements of laboring hand-crafted prompt engineering in the fine-tuning
process. We notice that these methods are suboptimal in two aspects. First, the
prompts of the vision and language branches in these methods are usually
separated or uni-directionally correlated. Thus, the prompts of both branches
are not fully correlated and may not provide enough guidance to align the
representations of both branches. Second, it's observed that most previous
methods usually achieve better performance on seen classes but cause
performance degeneration on unseen classes compared to CLIP. This is because
the essential generic knowledge learned in the pretraining stage is partly
forgotten in the fine-tuning process. In this paper, we propose Co-Articulated
Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our
method considers prompts from both branches to generate the prompts to enhance
the representation alignment of both branches. Besides, to alleviate forgetting
about the essential knowledge, we minimize the feature discrepancy between the
learned prompts and the embeddings of hand-crafted prompts in the pre-trained
CLIP in the late transformer layers. We evaluate our method across three
representative tasks of generalization to novel classes, new target datasets
and unseen domain shifts. Experimental results demonstrate the superiority of
our method by exhibiting a favorable performance boost upon all tasks with high
efficiency.
Related papers
- Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery.
The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - Multi-Prompt with Depth Partitioned Cross-Modal Learning [25.239388488952375]
Partitioned Multi-modal Prompt (PMPO) is a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts.
Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture hierarchical contextual depths.
We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization.
arXiv Detail & Related papers (2023-05-10T14:54:29Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models.
CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework.
Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Prompt Learning with Optimal Transport for Vision-Language Models [25.928455328563402]
We learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts.
To solve this problem, we propose to apply optimal transport to match the vision and text modalities.
In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data.
arXiv Detail & Related papers (2022-10-03T22:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.