APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning
- URL: http://arxiv.org/abs/2401.06827v2
- Date: Tue, 23 Jan 2024 08:54:15 GMT
- Title: APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning
- Authors: Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang and Guandong Xu
- Abstract summary: We propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner.
APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.
- Score: 15.844451999840588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained Vision-Language (V-L) models set the benchmark for generalization
to downstream tasks among the noteworthy contenders. Many characteristics of
the V-L model have been explored in existing research including the challenge
of the sensitivity to text input and the tuning process across multi-modal
prompts. With the advanced utilization of the V-L model like CLIP, recent
approaches deploy learnable prompts instead of hand-craft prompts to boost the
generalization performance and address the aforementioned challenges. Inspired
by layer-wise training, which is wildly used in image fusion, we note that
using a sequential training process to adapt different modalities branches of
CLIP efficiently facilitates the improvement of generalization. In the context
of addressing the multi-modal prompting challenge, we propose Token-wise
Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities
prompts, vision and language, as tokens in a sequential manner. APLe addresses
the challenges in V-L models to promote prompt learning across both modalities,
which indicates a competitive generalization performance in line with the
state-of-the-art. Preeminently, APLe shows robustness and favourable
performance in prompt-length experiments with an absolute advantage in adopting
the V-L models.
Related papers
- The Power of Adaptation: Boosting In-Context Learning through Adaptive Prompting [8.260097638532878]
Large Language Models (LLMs) have demonstrated exceptional abilities across a broad range of language-related tasks.
We propose textscAdaptive-Prompt, a novel method that adaptively selects exemplars by leveraging model feedback.
Experimental results show that textscAdaptive-Prompt significantly enhances LLM performance across a variety of reasoning tasks.
arXiv Detail & Related papers (2024-12-23T15:49:43Z) - Modality-Inconsistent Continual Learning of Multimodal Large Language Models [37.15220266767881]
We introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs)
Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting.
We propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities.
arXiv Detail & Related papers (2024-12-17T16:13:56Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality [11.03329286331929]
We present the first comprehensive investigation into prompt learning behavior when modalities are incomplete.
We propose a novel Multi-step Adaptive Prompt Learning framework, aiming to generate multimodal prompts and perform multi-step prompt tuning.
arXiv Detail & Related papers (2024-09-07T03:33:46Z) - Advancing Prompt Learning through an External Layer [24.77977865016954]
We propose a paradigm called EnPrompt with a novel External Layer (EnLa)
The learnable external layer is built upon valid embeddings of pre-trained CLIP.
Four experiments demonstrate that our method outperforms the existing prompt learning method.
arXiv Detail & Related papers (2024-07-29T03:30:09Z) - Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation.
We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation.
We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Context-Aware Prompt Tuning for Vision-Language Model with
Dual-Alignment [15.180715595425864]
We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs)
With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling.
Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization.
arXiv Detail & Related papers (2023-09-08T06:51:15Z) - Self-regulating Prompts: Foundational Model Adaptation without
Forgetting [112.66832145320434]
We introduce a self-regularization framework for prompting called PromptSRC.
PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations.
arXiv Detail & Related papers (2023-07-13T17:59:35Z) - Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification.
The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model.
Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.