Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting
- URL: http://arxiv.org/abs/2306.00409v2
- Date: Tue, 22 Aug 2023 07:45:09 GMT
- Title: Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting
- Authors: Shubin Huang, Qiong Wu, Yiyi Zhou, Weijie Chen, Rongsheng Zhang,
Xiaoshuai Sun, Rongrong Ji
- Abstract summary: Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
- Score: 83.21164539349273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (PLMs) have played an increasing role in
multimedia research. In terms of vision-language (VL) tasks, they often serve
as a language encoder and still require an additional fusion network for VL
reasoning, resulting in excessive memory overhead. In this paper, we focus on
exploring PLMs as a stand-alone model for VL reasoning tasks. Inspired by the
recently popular prompt tuning, we first prove that the processed visual
features can be also projected onto the semantic space of PLMs and act as
prompt tokens to bridge the gap between single- and multi-modal learning.
However, this solution exhibits obvious redundancy in visual information and
model inference, and the placement of prompt tokens also greatly affects the
final performance. Based on these observations, we further propose a novel
transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP).
Concretely, DVP first deploys a cross-attention module to obtain text-related
and compact visual prompt tokens, thereby greatly reducing the input length of
PLMs. To obtain the optimal placement, we also equip DVP with a
reinforcement-learning based search algorithm, which can automatically merge
DVP with PLMs for different VL tasks via a very short search process. In
addition, we also experiment DVP with the recently popular adapter approach to
keep the most parameters of PLMs intact when adapting to VL tasks, helping PLMs
achieve a quick shift between single- and multi-modal tasks. We apply DVP to
two representative PLMs, namely BERT and T5, and conduct extensive experiments
on a set of VL reasoning benchmarks including VQA2.0, GQA and SNLIVE. The
experimental results not only show the advantage of DVP on efficiency and
performance, but also confirm its superiority in adapting pre-trained language
models to VL tasks.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Progressive Multi-modal Conditional Prompt Tuning [92.50645776024624]
Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting.
We propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT)
ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information.
arXiv Detail & Related papers (2024-04-18T02:40:31Z) - MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning [28.254318215697527]
Vision-Language models (VLMs) come with high computational costs due to their large number of parameters.
Existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest.
We explore a new direction: Task-Agnostic Vision-Language Pruning (TA-language)
We propose Multimodal FlowPruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-language.
arXiv Detail & Related papers (2024-04-08T15:51:21Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - Towards Versatile and Efficient Visual Knowledge Integration into
Pre-trained Language Models with Cross-Modal Adapters [16.44174900423759]
We propose a new plug-and-play module, X-adapter, to leverage the aligned visual and textual knowledge learned in pre-trained vision-language models.
Our method can significantly improve the performance on object-color reasoning and natural language understanding tasks.
arXiv Detail & Related papers (2023-05-12T10:08:46Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.