Understanding the Multi-modal Prompts of the Pre-trained Vision-Language
Model
- URL: http://arxiv.org/abs/2312.11570v3
- Date: Tue, 12 Mar 2024 01:19:05 GMT
- Title: Understanding the Multi-modal Prompts of the Pre-trained Vision-Language
Model
- Authors: Shuailei Ma, Chen-Wei Xie, Ying Wei, Siyang Sun, Jiaqi Fan, Xiaoyi
Bao, Yuxin Guo, Yun Zheng
- Abstract summary: We conduct a direct analysis of the multi-modal prompts by asking the following questions.
$(i)$ How do the learned multi-modal prompts improve the recognition performance?
$(ii)$ What do the multi-modal prompts learn?
- Score: 15.828023370166411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt learning has emerged as an efficient alternative for fine-tuning
foundational models, such as CLIP, for various downstream tasks. However, there
is no work that provides a comprehensive explanation for the working mechanism
of the multi-modal prompts. In this paper, we conduct a direct analysis of the
multi-modal prompts by asking the following questions: $(i)$ How do the learned
multi-modal prompts improve the recognition performance? $(ii)$ What do the
multi-modal prompts learn? To answer these questions, we begin by isolating the
component of the formula where the prompt influences the calculation of
self-attention at each layer in two distinct ways, \ie, $(1)$ introducing
prompt embeddings makes the $[cls]$ token focus on foreground objects. $(2)$
the prompts learn a bias term during the update of token embeddings, allowing
the model to adapt to the target domain. Subsequently, we conduct extensive
visualization and statistical experiments on the eleven diverse downstream
recognition datasets. From the experiments, we reveal that the learned prompts
improve the performance mainly through the second way, which acts as the
dataset bias to improve the recognition performance of the pre-trained model on
the corresponding dataset. Meanwhile, we propose the bias tuning way to
validate our finding. With a deeper understanding of the multi-modal prompt, we
hope our work can inspire new and solid research in this direction.
Related papers
- Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - On the Role of Attention in Prompt-tuning [90.97555030446563]
We study prompt-tuning for one-layer attention architectures and study contextual mixture-models.
We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention.
We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information.
arXiv Detail & Related papers (2023-06-06T06:23:38Z) - PLAR: Prompt Learning for Action Recognition [56.57236976757388]
We present a new general learning approach, Prompt Learning for Action Recognition (PLAR)
Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos.
We observe a 3.110-7.2% accuracy improvement on the aerial multi-agent dataset Okutamam and a 1.0-3.6% improvement on the ground camera single-agent dataset Something Something V2.
arXiv Detail & Related papers (2023-05-21T11:51:09Z) - Multi-Prompt with Depth Partitioned Cross-Modal Learning [25.239388488952375]
Partitioned Multi-modal Prompt (PMPO) is a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts.
Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture hierarchical contextual depths.
We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization.
arXiv Detail & Related papers (2023-05-10T14:54:29Z) - Dynamic Prompting: A Unified Framework for Prompt Tuning [33.175097465669374]
We present a unified dynamic prompt (DP) tuning strategy that dynamically determines different factors of prompts based on specific tasks and instances.
Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks.
We establish the universal applicability of our approach under full-data, few-shot, and multitask scenarios.
arXiv Detail & Related papers (2023-03-06T06:04:46Z) - Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with
Multimodal Models [61.97890177840515]
Humans use cross-modal information to learn new concepts efficiently.
We propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities.
arXiv Detail & Related papers (2023-01-16T05:40:42Z) - Instance-aware Prompt Learning for Language Understanding and Generation [49.22899822734549]
We propose an instance-aware prompt learning method that learns a different prompt for each instance.
Our method achieves the state-of-the-art on the SuperGLUE few-shot learning benchmark.
arXiv Detail & Related papers (2022-01-18T17:03:25Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.