GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
- URL: http://arxiv.org/abs/2511.10081v1
- Date: Fri, 14 Nov 2025 01:30:53 GMT
- Title: GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
- Authors: Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang,
- Abstract summary: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks.<n>Visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs.<n>We propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system.
- Score: 2.9869094956508495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.
Related papers
- GeoToken: Hierarchical Geolocalization of Images via Next Token Prediction [23.767061975974134]
We propose a hierarchical sequence prediction approach inspired by how humans narrow down locations from broad regions to specific addresses.<n>Our method uses S2 cells, a nested, multiresolution global grid, and sequentially predicts finer-level cells conditioned on visual inputs and previous predictions.<n>We evaluate our method on the Im2GPS3k and YFCC4k datasets against two distinct sets of baselines.
arXiv Detail & Related papers (2025-11-02T21:30:06Z) - SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z) - Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z) - HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score [14.857585045577165]
HIVTP is a training-free method to improve Vision-Language Models (VLMs) inference efficiency.<n>We propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens.<n> Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively.
arXiv Detail & Related papers (2025-09-28T05:53:39Z) - Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive [58.0729162588429]
Interactive segmentation improves annotation efficiency by segmenting target regions from user prompts.<n>Current approaches face a critical trade-off: dense-token methods achieve superior accuracy but suffer from prohibitively slow processing on CPU devices.<n>We propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing.
arXiv Detail & Related papers (2025-07-13T12:33:37Z) - CROP: Contextual Region-Oriented Visual Token Pruning [9.099029419132775]
Contextual Region-Oriented Visual Token Pruning (CROP) is a novel framework to compress visual tokens.<n>Two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early layers guided by the identified contextual region.
arXiv Detail & Related papers (2025-05-27T14:16:52Z) - Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.<n>It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.<n>Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.<n>The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.<n>We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z) - TokenPacker: Efficient Visual Projector for Multimodal LLM [37.1071749188282]
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM)
We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens.
Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
arXiv Detail & Related papers (2024-07-02T16:10:55Z) - Dynamic Focus-aware Positional Queries for Semantic Segmentation [94.6834904076914]
We propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries.
Our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones.
arXiv Detail & Related papers (2022-04-04T05:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.