Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
- URL: http://arxiv.org/abs/2405.05615v1
- Date: Thu, 9 May 2024 08:23:20 GMT
- Title: Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
- Authors: Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang,
- Abstract summary: Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
- Score: 59.13366859237086
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring the models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it significantly increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as "key-value memory", we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with the weights of FFN for visual knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces the training time and inference latency of the finetuned VL models and surpasses the performance of previous PEFT methods. Code: https://github.com/JieShibo/MemVP
Related papers
- FiVL: A Framework for Improved Vision-Language Alignment [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.
These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence.
To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability.
arXiv Detail & Related papers (2024-12-19T09:24:10Z) - MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding [6.538592344967826]
We introduce MUSE-VL, a Unified Vision-Language Model Semantic through discrete -language for multimodal understanding and generation.
The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
arXiv Detail & Related papers (2024-11-26T03:33:52Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - A-VL: Adaptive Attention for Large Vision-Language Models [10.027871150748956]
Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential.
Current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models.
We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns.
We develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference.
arXiv Detail & Related papers (2024-09-23T09:22:59Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.