Related papers: ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

URL: http://arxiv.org/abs/2412.20105v1
Date: Sat, 28 Dec 2024 10:17:29 GMT
Title: ST$^3$: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Authors: Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu,
Abstract summary: $textbfST3$ is a framework designed to accelerate MLLM inference without retraining.<n>$textbfST3$ can be seamlessly integrated into existing pre-trained MLLMs.
Score: 14.937905258757635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms remains shallow, leading to coarse-grain token pruning strategies that fail to effectively balance speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. Based on this insight, we propose Spatial-Temporal Visual Token Trimming ($\textbf{ST}^{3}$), a framework designed to accelerate MLLM inference without retraining. $\textbf{ST}^{3}$ consists of two primary components: 1) Progressive Visual Token Pruning (\textbf{PVTP}), which eliminates inattentive visual tokens across layers, and 2) Visual Token Annealing (\textbf{VTA}), which dynamically reduces the number of visual tokens in each layer as the generated tokens grow. Together, these techniques deliver around $\mathbf{2\times}$ faster inference with only about $\mathbf{30\%}$ KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Crucially, $\textbf{ST}^{3}$ can be seamlessly integrated into existing pre-trained MLLMs, providing a plug-and-play solution for efficient inference.

Related papers

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization [49.5501769221435]
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information.<n>Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages.<n>We propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.<n> Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8
arXiv Detail & Related papers (2025-08-07T09:47:21Z)
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective [6.258220461022373]
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency.<n>We show that token compression is feasible at the input stage of LLM with negligible performance loss.<n>We propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass.
arXiv Detail & Related papers (2025-06-01T17:44:16Z)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z)
Window Token Concatenation for Efficient Visual Large Language Models [59.6094005814282]
We propose Window Token Concatenation (WiCo) to reduce visual tokens in Visual Large Language Models (VLLMs) WiCo group diverse tokens into one, and thus obscure some fine details. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors.
arXiv Detail & Related papers (2025-04-05T02:32:58Z)
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [1.8893427856534721]
We propose InternVL-X, which outperforms the InternVL model in both performance and efficiency. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
arXiv Detail & Related papers (2025-03-27T09:31:35Z)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z)
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z)
[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster [26.025260449905577]
Existing methods assess the importance of visual tokens based on the text-visual cross-attentions in large language models (LLMs) We introduce FasterVLM, a training-free visual token pruning method that evaluates the importance of visual tokens more accurately. FasterVLM can prune 95% of visual tokens while maintaining 90% of the performance of LLaVA-1.5-7B.
arXiv Detail & Related papers (2024-12-02T18:57:40Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens.<n>As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck.<n>In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z)
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead. We propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks.
arXiv Detail & Related papers (2024-10-06T09:18:04Z)
Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue. Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed. Our VTW approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.