METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
- URL: http://arxiv.org/abs/2507.20842v1
- Date: Mon, 28 Jul 2025 13:50:53 GMT
- Title: METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
- Authors: Yuchen Liu, Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian,
- Abstract summary: We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
- Score: 92.37117312251755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.
Related papers
- LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models [62.240460476785934]
We propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder.<n>LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts.
arXiv Detail & Related papers (2025-07-03T03:42:54Z) - FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms [16.41418610688371]
We introduce CrossLMM, which substantially reduces visual token quantity with minimal performance degradation.<n>We also introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens.<n>Our approach achieves comparable or superior performance across diverse video-based Large Language Models benchmarks.
arXiv Detail & Related papers (2025-05-22T17:59:53Z) - ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task [34.269081635534526]
We propose ToFu, a visual encoder-agnostic, training-free Token Fusion strategy for high-resolution, multi-image, tasks.<n>We validate our approach on the well-established LLaVA-Interleave Bench, which covers challenging multi-image tasks.
arXiv Detail & Related papers (2025-03-06T14:00:59Z) - MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing [2.0249250133493195]
Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter.<n>We propose MOVE (Mixture of Visions) to leverage multiple pre-trained encoders for specialized tasks.
arXiv Detail & Related papers (2025-02-21T11:05:30Z) - Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process.
Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z) - A Simple Baseline with Single-encoder for Referring Image Segmentation [14.461024566536478]
We present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention.<n>Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets.
arXiv Detail & Related papers (2024-08-28T04:14:01Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.