Deeply Coupled Cross-Modal Prompt Learning
- URL: http://arxiv.org/abs/2305.17903v3
- Date: Wed, 6 Dec 2023 15:52:03 GMT
- Title: Deeply Coupled Cross-Modal Prompt Learning
- Authors: Xuejing Liu, Wei Tang, Jinghui Lu, Rui Zhao, Zhaojun Guo and Fei Tan
- Abstract summary: We propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP.
DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism.
We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the adaption to domain shift as well.
- Score: 25.813769028565567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in multimodal foundation models (e.g., CLIP) have
excelled in zero-shot generalization. Prompt tuning involved in the knowledge
transfer from foundation models to downstream tasks has gained significant
attention recently. Existing prompt-tuning methods in cross-modal learning,
however, either solely focus on language branch, or learn vision-language
interaction in a shallow mechanism. In this context, we propose a Deeply
coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly
accommodates the interplay between vision and language with a Cross-Modal
Prompt Attention (CMPA) mechanism, which enables the mutual exchange of
respective representation through a well-connected multi-head attention module
progressively and strongly. We then conduct comprehensive few-shot learning
experiments on 11 image classification datasets and analyze the robustness to
domain shift as well. Thorough experimental analysis evidently demonstrates the
superb few-shot generalization and compelling domain adaption capacity of a
well-executed DCP. The code can be found at https://github.com/GingL/CMPA.
Related papers
- CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training [17.27516384073838]
We propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning.
CMAL achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks.
arXiv Detail & Related papers (2024-10-16T14:12:26Z) - CP-Prompt: Composition-Based Cross-modal Prompting for Domain-Incremental Continual Learning [15.393734346359064]
Key challenge of cross-modal domain-incremental learning (DIL) is to enable the learning model to continuously learn from novel data.
We propose a simple yet effective framework, CP-Prompt, by training limited parameters to instruct a pre-trained model to learn new domains.
arXiv Detail & Related papers (2024-07-22T04:07:12Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.