Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning
- URL: http://arxiv.org/abs/2602.02951v1
- Date: Tue, 03 Feb 2026 00:51:03 GMT
- Title: Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning
- Authors: Yihong Huang, Fei Ma, Yihua Shao, Jingcai Guo, Zitong Yu, Laizhong Cui, Qi Tian,
- Abstract summary: Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
- Score: 82.39668822222386
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM's processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).
Related papers
- IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning [27.75049214892312]
Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks.<n>A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs.<n>We propose textbfIVC-Prune, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens.
arXiv Detail & Related papers (2026-02-03T03:39:31Z) - Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z) - HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score [14.857585045577165]
HIVTP is a training-free method to improve Vision-Language Models (VLMs) inference efficiency.<n>We propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens.<n> Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively.
arXiv Detail & Related papers (2025-09-28T05:53:39Z) - Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs [1.985072438058346]
We propose GazeVLM, a training-free framework that uses the human eye gaze as a natural supervisory signal to allocate where it matters.<n>Our results show that aligning model computation with human gaze offers a simple, plug-and-play path toward efficient VLM inference on consumer devices.
arXiv Detail & Related papers (2025-09-20T00:16:48Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - Event-Priori-Based Vision-Language Model for Efficient Visual Understanding [13.540340702321911]
Event-Priori-Based Vision-Language Model (EP-VLM) improves VLM inference efficiency.<n>EP-VLM uses motion priors derived from dynamic event vision to enhance VLM efficiency.
arXiv Detail & Related papers (2025-06-09T10:45:35Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs [65.00970402080351]
A promising approach to accelerating large vision-language models (VLMs) is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens.<n>Our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM,
arXiv Detail & Related papers (2024-12-04T13:56:44Z) - Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding [33.33424214458285]
Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks.
However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge.
We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects.
arXiv Detail & Related papers (2023-11-30T03:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.