Related papers: MoVA: Adapting Mixture of Vision Experts to Multimodal Context

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

URL: http://arxiv.org/abs/2404.13046v2
Date: Thu, 31 Oct 2024 17:39:34 GMT
Title: MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Authors: Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu,
Abstract summary: We propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts. In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge.
Score: 38.8308841469793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.

Related papers

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z)
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model [1.3663057923522652]
We introduce Fusion to Enhance (FtZ), a novel vision tower framework.<n>FtZ moves beyond the single-encoder design by innovatively composing a semantically powerful anchor encoder with a perception-rich augmenting encoder.<n>This work proves that composing heterogeneous expert encoders is an efficient and effective path to overcoming the visual perception bottleneck in current MLLMs.
arXiv Detail & Related papers (2025-08-31T02:22:57Z)
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts [104.73983712940816]
Multimodal large language models (MLLMs) require nuanced interpretation of complex image information.<n> relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts.<n>We introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder.
arXiv Detail & Related papers (2025-05-30T12:48:07Z)
ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts [29.446235941754345]
Vision-language (VL) learning requires extensive visual perception capabilities. Recent works typically rely on training huge models on massive datasets to develop these capabilities. This paper proposes a new framework that transfers the knowledge from a hub of Vision Experts.
arXiv Detail & Related papers (2025-04-01T12:02:40Z)
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models [53.13731845500678]
We introduce a novel metric, $Rank_e$, to quantify the effect of vision encoder's prior knowledge on MLLM performance. We propose VisPRE, a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs.
arXiv Detail & Related papers (2025-03-23T11:33:09Z)
MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing [2.0249250133493195]
Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. We propose MOVE (Mixture of Visions) to leverage multiple pre-trained encoders for specialized tasks.
arXiv Detail & Related papers (2025-02-21T11:05:30Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. Lumen first promotes fine-grained vision-language concept alignment. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z)
Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning. It embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z)
MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders. This technique introduces a fusion network to unify the processing of outputs from different visual experts. In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z)
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition [39.92547393649842]
We introduce a novel Multimodal, Multi-task CLIP adapting framework named name to address these challenges. We demonstrate exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
arXiv Detail & Related papers (2024-01-22T02:03:31Z)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.