Related papers: Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

URL: http://arxiv.org/abs/2507.03262v1
Date: Fri, 04 Jul 2025 02:38:59 GMT
Title: Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders
Authors: Song Mao, Yang Chen, Pinglong Cai, Ding Wang, Guohang Yan, Zhi Yu, Botian Shi,
Abstract summary: Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information.<n>We observe that the performance gains from adding encoders often diminish and can even lead to performance degradation.<n>To quantify each encoder's unique contribution, we propose a metric: the Conditional Utilization Rate (CUR)
Score: 17.14555102933619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information, ranging from coarse semantics to fine grained details. While this approach is intended to enhance visual understanding capability, we observe that the performance gains from adding encoders often diminish and can even lead to performance degradation, a phenomenon we term encoder redundancy. This paper presents a systematic investigation into this issue. Through comprehensive ablation studies on state of the art multi encoder MLLMs, we empirically demonstrate that significant redundancy exists. To quantify each encoder's unique contribution, we propose a principled metric: the Conditional Utilization Rate (CUR). Building on CUR, we introduce the Information Gap (IG) to capture the overall disparity in encoder utility within a model.Our experiments reveal that certain vision encoders contribute little, or even negatively, to overall performance, confirming substantial redundancy. Our experiments reveal that certain vision encoders contribute minimally, or even negatively, to the model's performance, confirming the prevalence of redundancy. These findings highlight critical inefficiencies in current multi encoder designs and establish that our proposed metrics can serve as valuable diagnostic tools for developing more efficient and effective multimodal architectures.

Related papers

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z)
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts [104.73983712940816]
Multimodal large language models (MLLMs) require nuanced interpretation of complex image information.<n> relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts.<n>We introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder.
arXiv Detail & Related papers (2025-05-30T12:48:07Z)
A Shared Encoder Approach to Multimodal Representation Learning [17.863705872504]
We present a shared encoder framework for multimodal representation learning tailored to the medical domain.<n>Our approach employs a single set of encoder parameters shared across modalities, augmented with learnable modality features.
arXiv Detail & Related papers (2025-03-03T15:29:26Z)
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.<n>We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.<n>We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z)
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders [28.22099619211775]
Visual encoders are fundamental components in vision-language models (VLMs)<n>Recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost.<n>We present a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model.
arXiv Detail & Related papers (2025-01-03T09:10:34Z)
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions.<n>We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.<n>The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z)
MoVA: Adapting Mixture of Vision Experts to Multimodal Context [38.8308841469793]
We propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts. In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge.
arXiv Detail & Related papers (2024-04-19T17:59:48Z)
MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders. This technique introduces a fusion network to unify the processing of outputs from different visual experts. In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z)
Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. Recent work has proposed to use representations from different encoder layers for diversified levels of information. We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.