Deeply Coupled Cross-Modal Prompt Learning
        - URL: http://arxiv.org/abs/2305.17903v3
- Date: Wed, 6 Dec 2023 15:52:03 GMT
- Title: Deeply Coupled Cross-Modal Prompt Learning
- Authors: Xuejing Liu, Wei Tang, Jinghui Lu, Rui Zhao, Zhaojun Guo and Fei Tan
- Abstract summary: We propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP.
DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism.
We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the adaption to domain shift as well.
- Score: 25.813769028565567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Recent advancements in multimodal foundation models (e.g., CLIP) have
excelled in zero-shot generalization. Prompt tuning involved in the knowledge
transfer from foundation models to downstream tasks has gained significant
attention recently. Existing prompt-tuning methods in cross-modal learning,
however, either solely focus on language branch, or learn vision-language
interaction in a shallow mechanism. In this context, we propose a Deeply
coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly
accommodates the interplay between vision and language with a Cross-Modal
Prompt Attention (CMPA) mechanism, which enables the mutual exchange of
respective representation through a well-connected multi-head attention module
progressively and strongly. We then conduct comprehensive few-shot learning
experiments on 11 image classification datasets and analyze the robustness to
domain shift as well. Thorough experimental analysis evidently demonstrates the
superb few-shot generalization and compelling domain adaption capacity of a
well-executed DCP. The code can be found at https://github.com/GingL/CMPA.
 
      
        Related papers
        - Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept   Calibration [42.24582981160835]
 Open Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects.<n>Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders.<n>We propose INteraction-aware Prompting with Concept (INP-CC), an end-to-end open-vocabulary HOI detector.
 arXiv  Detail & Related papers  (2025-08-05T08:33:58Z)
- Multi-modal Mutual-Guidance Conditional Prompt Learning for   Vision-Language Models [21.20658517302458]
 MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning) is a novel paradigm designed for conditional prompt generation.<n> AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks.<n>MPF mechanism integrates SCP andVCP with contextual prompts, ensuring seamless coordination.
 arXiv  Detail & Related papers  (2025-07-11T08:45:27Z)
- ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain   Incremental Learning in CLIP [12.031278034659872]
 Continual learning empowers pre-trained vision-language models to adapt effectively to novel or previously underrepresented data distributions.<n>ChordPrompt introduces cross-modal prompts to leverage interactions between visual and textual information.<n>ChordPrompt outperforms state-of-the-art methods in zero-shot generalization and downstream task performance.
 arXiv  Detail & Related papers  (2025-06-24T13:22:06Z)
- CMAL: A Novel Cross-Modal Associative Learning Framework for   Vision-Language Pre-Training [17.27516384073838]
 We propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning.
CMAL achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks.
 arXiv  Detail & Related papers  (2024-10-16T14:12:26Z)
- CP-Prompt: Composition-Based Cross-modal Prompting for   Domain-Incremental Continual Learning [15.393734346359064]
 Key challenge of cross-modal domain-incremental learning (DIL) is to enable the learning model to continuously learn from novel data.
We propose a simple yet effective framework, CP-Prompt, by training limited parameters to instruct a pre-trained model to learn new domains.
 arXiv  Detail & Related papers  (2024-07-22T04:07:12Z)
- Concept-Guided Prompt Learning for Generalization in Vision-Language
  Models [33.361744437967126]
 We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
 arXiv  Detail & Related papers  (2024-01-15T04:04:47Z)
- APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
 We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
 arXiv  Detail & Related papers  (2023-12-04T01:42:09Z)
- Exploiting Modality-Specific Features For Multi-Modal Manipulation
  Detection And Grounding [54.49214267905562]
 We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
 arXiv  Detail & Related papers  (2023-09-22T06:55:41Z)
- DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
 We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
 arXiv  Detail & Related papers  (2023-08-19T15:48:38Z)
- Continual Vision-Language Representation Learning with Off-Diagonal
  Information [112.39419069447902]
 Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
 arXiv  Detail & Related papers  (2023-05-11T08:04:46Z)
- MaPLe: Multi-modal Prompt Learning [54.96069171726668]
 We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
 arXiv  Detail & Related papers  (2022-10-06T17:59:56Z)
- Learning Visual Representation from Modality-Shared Contrastive
  Language-Image Pre-training [88.80694147730883]
 We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
 arXiv  Detail & Related papers  (2022-07-26T05:19:16Z)
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
  Skip-connections [104.14624185375897]
 mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
 arXiv  Detail & Related papers  (2022-05-24T11:52:06Z)
- WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
  Pre-Training [71.37731379031487]
 We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
 arXiv  Detail & Related papers  (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.