CoPL: Contextual Prompt Learning for Vision-Language Understanding
- URL: http://arxiv.org/abs/2307.00910v2
- Date: Tue, 12 Dec 2023 05:34:16 GMT
- Title: CoPL: Contextual Prompt Learning for Vision-Language Understanding
- Authors: Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph
and Balaji Vasan Srinivasan
- Abstract summary: We propose a Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image.
Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand.
Our method produces substantially improved performance when compared to the current state of the art methods.
- Score: 21.709017504227823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal learning has resulted in powerful
vision-language models, whose representations are generalizable across a
variety of downstream tasks. Recently, their generalization ability has been
further extended by incorporating trainable prompts, borrowed from the natural
language processing literature. While such prompt learning techniques have
shown impressive results, we identify that these prompts are trained based on
global image features which limits itself in two aspects: First, by using
global features, these prompts could be focusing less on the discriminative
foreground image, resulting in poor generalization to various
out-of-distribution test cases. Second, existing work weights all prompts
equally whereas intuitively, prompts should be reweighed according to the
semantics of the image. We address these as part of our proposed Contextual
Prompt Learning (CoPL) framework, capable of aligning the prompts to the
localized features of the image. Our key innovations over earlier works include
using local image features as part of the prompt learning process, and more
crucially, learning to weight these prompts based on local features that are
appropriate for the task at hand. This gives us dynamic prompts that are both
aligned to local image features as well as aware of local contextual
relationships. Our extensive set of experiments on a variety of standard and
few-shot datasets show that our method produces substantially improved
performance when compared to the current state of the art methods. We also
demonstrate both few-shot and out-of-distribution performance to establish the
utility of learning dynamic prompts that are aligned to local image features.
Related papers
- Instructing Prompt-to-Prompt Generation for Zero-Shot Learning [116.33775552866476]
We propose a textbfPrompt-to-textbfPrompt generation methodology (textbfP2P) to distill instructive visual prompts for transferable knowledge discovery.
The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts.
arXiv Detail & Related papers (2024-06-05T07:59:48Z) - ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
We propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples.
We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets.
arXiv Detail & Related papers (2024-06-04T02:28:51Z) - mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis [16.472295458683696]
Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging.
We introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE)
This novel text-guided approach effectively captures multi-scale Whole Slide Images (WSIs) by utilizing information from accompanying textual pathology information.
arXiv Detail & Related papers (2024-05-28T04:47:44Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
Multimodal Alignment [11.556516260190737]
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research.
This paper proposes Contrastive Captioners (CoCa) to integrate Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework.
arXiv Detail & Related papers (2024-01-04T08:42:36Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - ECO: Ensembling Context Optimization for Vision-Language Models [22.32996522125523]
We show that learning diverse and possibly shorter contexts improves considerably and consistently the results.
We report better few-shot capabilities with no additional cost at inference time.
arXiv Detail & Related papers (2023-07-26T09:31:06Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.