An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- URL: http://arxiv.org/abs/2403.06764v3
- Date: Mon, 2 Sep 2024 05:48:54 GMT
- Title: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- Authors: Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang,
- Abstract summary: We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
- Score: 65.37846460916042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
Related papers
- InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [1.8893427856534721]
We propose InternVL-X, which outperforms the InternVL model in both performance and efficiency.
By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
arXiv Detail & Related papers (2025-03-27T09:31:35Z) - Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding.
We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models.
Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z) - FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression [16.53645461974695]
Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution images.
We propose an efficient visual token compression framework for text-oriented Vision Large Language Models (VLLMs) in high-resolution scenarios.
Our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks.
arXiv Detail & Related papers (2025-02-22T16:05:33Z) - VisionZip: Longer is Better but Not Necessary in Vision Language Models [53.199716363090154]
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens.
Visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy.
We introduce VisionZip, a method that selects a set of informative tokens for input to the language model.
arXiv Detail & Related papers (2024-12-05T18:59:53Z) - Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens.
We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression [29.163757099307553]
We present ZipVL, an efficient inference framework designed for large vision-language models (LVLMs)
ZipVL resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens.
Experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$times$ and reduce GPU memory usage by 50.0%.
arXiv Detail & Related papers (2024-10-11T07:24:21Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation [29.34754905469359]
AVESFormer is the first real-time Audio-Visual Efficient transformer that achieves fast, efficient and light-weight simultaneously.
AVESFormer significantly enhances model performance, achieving 79.9% on S4, 57.9% on MS3 and 31.2% on AVSS.
arXiv Detail & Related papers (2024-08-03T08:25:26Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - MiniVLM: A Smaller and Faster Vision-Language Model [76.35880443015493]
MiniVLM consists of two modules, a vision feature extractor and a vision-language fusion module.
MiniVLM reduces the model size by $73%$ and the inference time cost by $94%$.
arXiv Detail & Related papers (2020-12-13T03:02:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.