Related papers: ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

URL: http://arxiv.org/abs/2504.00502v1
Date: Tue, 01 Apr 2025 07:47:55 GMT
Title: ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Authors: Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun,
Abstract summary: Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens.<n>We introduce a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens.<n>We propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers.
Score: 42.85956522420325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

Related papers

Window Token Concatenation for Efficient Visual Large Language Models [59.6094005814282]
We propose Window Token Concatenation (WiCo) to reduce visual tokens in Visual Large Language Models (VLLMs) WiCo group diverse tokens into one, and thus obscure some fine details. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.
arXiv Detail & Related papers (2025-04-05T02:32:58Z)
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph [15.364317811275344]
We propose a graph-based method towards training-free visual token pruning, termed G-Prune. G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks.
arXiv Detail & Related papers (2025-01-04T12:14:42Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models [87.43596173378913]
We propose an innovative strategy for existing MLLMs called $gamma$-MoD. In $gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM. Based on ARank, we propose two novel designs to maximize the computational sparsity of MLLM.
arXiv Detail & Related papers (2024-10-17T17:59:53Z)
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens.<n>As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck.<n>In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z)
Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved significant success in various tasks.<n>Main computational burden arises from processingd text and visual tokens.<n>We propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.<n>VTW strategically withdraws vision tokens at a certain layer, enabling only text tokens to engage in subsequent layers.<n>Our approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.