Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
- URL: http://arxiv.org/abs/2509.01552v1
- Date: Mon, 01 Sep 2025 15:28:44 GMT
- Title: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
- Authors: Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen,
- Abstract summary: Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks.<n> Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency.<n>We propose Variation-aware Vision Token Dropping (textiti.e., textbfV$2$Drop), which progressively removes visual tokens with minimal variation during LVLM inference.
- Score: 24.952668143243542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain \textbf{94.0\%} and \textbf{98.6\%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.
Related papers
- ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning [8.933549837045932]
Large Vision-Language Models incur high computational costs due to significant redundancy in their visual tokens.<n>We propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the Large Language Models.
arXiv Detail & Related papers (2026-01-25T12:47:30Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [1.8893427856534721]
We propose InternVL-X, which outperforms the InternVL model in both performance and efficiency.<n>By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
arXiv Detail & Related papers (2025-03-27T09:31:35Z) - Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding.<n>We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models.<n> Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems.
One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information.
We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.