An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- URL: http://arxiv.org/abs/2403.06764v3
- Date: Mon, 2 Sep 2024 05:48:54 GMT
- Title: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- Authors: Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang,
- Abstract summary: We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
- Score: 65.37846460916042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
Related papers
- VisionZip: Longer is Better but Not Necessary in Vision Language Models [53.199716363090154]
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens.
Visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy.
We introduce VisionZip, a method that selects a set of informative tokens for input to the language model.
arXiv Detail & Related papers (2024-12-05T18:59:53Z) - [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster [26.025260449905577]
Existing methods assess the importance of visual tokens based on the text-visual cross-attentions in large language models (LLMs)
We introduce FasterVLM, a training-free visual token pruning method that evaluates the importance of visual tokens more accurately.
FasterVLM can prune 95% of visual tokens while maintaining 90% of the performance of LLaVA-1.5-7B.
arXiv Detail & Related papers (2024-12-02T18:57:40Z) - Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens.
We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification [29.163757099307553]
The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase.
We present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens.
arXiv Detail & Related papers (2024-10-11T07:24:21Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation [29.34754905469359]
AVESFormer is the first real-time Audio-Visual Efficient transformer that achieves fast, efficient and light-weight simultaneously.
AVESFormer significantly enhances model performance, achieving 79.9% on S4, 57.9% on MS3 and 31.2% on AVSS.
arXiv Detail & Related papers (2024-08-03T08:25:26Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.