Related papers: HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

URL: http://arxiv.org/abs/2509.23663v1
Date: Sun, 28 Sep 2025 05:53:39 GMT
Title: HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score
Authors: Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel,
Abstract summary: HIVTP is a training-free method to improve Vision-Language Models (VLMs) inference efficiency.<n>We propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens.<n> Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively.
Score: 14.857585045577165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.

Related papers

Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z)
All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs [43.80391827200227]
In deep layers, existing training-free pruning methods perform no better than random pruning.<n>Visual tokens progressively lose their salience with increasing network depth.<n>We show that simple random pruning in deep layers efficiently balances performance and efficiency.
arXiv Detail & Related papers (2025-12-08T14:16:01Z)
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z)
HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models [6.306822764683807]
HiPrune is a training-free and model-agnostic token Pruning framework for vision encoders.<n>It exploits the Hierarchical attention structure within vision encoders.<n>It preserves up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens.
arXiv Detail & Related papers (2025-08-01T11:48:11Z)
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models [57.2662376527586]
VScan is a two-stage visual token reduction framework.<n>It addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model.<n>VScan achieves a 2.91$times$ speedup in prefilling and a 10$times$ reduction in FLOPs, while retaining 95.4% of the original performance.
arXiv Detail & Related papers (2025-05-28T17:59:08Z)
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization [30.73986620551153]
Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens.<n>Previous approaches have attempted to reduce the number of image tokens through token pruning.<n>We propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens.
arXiv Detail & Related papers (2025-05-28T07:00:50Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression [1.8893427856534721]
We propose InternVL-X, which outperforms the InternVL model in both performance and efficiency.<n>By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
arXiv Detail & Related papers (2025-03-27T09:31:35Z)
PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models [48.31941033266855]
We propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method.<n>PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context.<n> Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%.
arXiv Detail & Related papers (2025-02-20T12:31:31Z)
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.<n>The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.<n>We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z)
FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.