Related papers: VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

URL: http://arxiv.org/abs/2601.22674v2
Date: Mon, 02 Feb 2026 09:21:10 GMT
Title: VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
Authors: Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu,
Abstract summary: Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens.<n>We propose VisionTrim, a unified framework for training-free MLLM acceleration.
Score: 31.27071437510817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

Related papers

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models [75.88232735646018]
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos.<n>Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations.<n>We propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM.
arXiv Detail & Related papers (2025-08-24T07:47:00Z)
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs [28.998923104606614]
DisCo is a visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs.<n>DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks.
arXiv Detail & Related papers (2025-07-14T14:05:19Z)
Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding [6.612630497074871]
Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding.<n>We propose ReVisiT, a training-free decoding method that references vision tokens to guide text generation.
arXiv Detail & Related papers (2025-06-11T08:46:55Z)
LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models [9.660892239615364]
This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO.<n>Leo is a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling.<n>We show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe.
arXiv Detail & Related papers (2025-01-13T00:29:55Z)
Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion [40.56646959926701]
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models.<n>Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders.<n>We introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs.
arXiv Detail & Related papers (2024-12-02T09:02:28Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)<n>It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.