Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
- URL: http://arxiv.org/abs/2507.01351v1
- Date: Wed, 02 Jul 2025 04:38:12 GMT
- Title: Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
- Authors: Chaoxiang Cai, Longrong Yang, Kaibing Chen, Fan Yang, Xi Li,
- Abstract summary: We propose a distribution-aware router for modality-specific routing in vision-language models.<n>We introduce an oversampling-like strategy by increasing the number of activated experts for vision tail tokens.<n>Experiments on extensive benchmarks validate the effectiveness of our approach.
- Score: 9.553346865898366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The mixture-of-experts (MoE), which replaces dense models with sparse architectures, has gained attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE frameworks for LVLMs focus on token-to-expert routing (TER), encouraging different experts to specialize in processing distinct tokens. However, these frameworks often rely on the load balancing mechanism, overlooking the inherent distributional differences between vision and language. To this end, we propose a Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, tackling two challenges: (1) Distribution-aware router for modality-specific routing. We observe that language TER follows a uniform distribution, whereas vision TER exhibits a long-tailed distribution. This discrepancy necessitates distinct routing strategies tailored to each modality. (2) Enhancing expert activation for vision tail tokens. Recognizing the importance of vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts for these tokens. Experiments on extensive benchmarks validate the effectiveness of our approach.
Related papers
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z) - Latent Implicit Visual Reasoning [59.39913238320798]
We propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision.<n>Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks.
arXiv Detail & Related papers (2025-12-24T14:59:49Z) - AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert [26.761443359046286]
We propose AnyExperts, a novel on-demand, budget-aware dynamic routing framework.<n>It allocates a variable total number of expert slots per token based on its semantic importance.<n>It is evaluated across diverse tasks in visual understanding, audio understanding, and NLP understanding.
arXiv Detail & Related papers (2025-11-23T06:53:43Z) - Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance [79.21541758879012]
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency.<n>We present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization.
arXiv Detail & Related papers (2025-10-28T17:59:02Z) - Spotlight on Token Perception for Multimodal Reinforcement Learning [65.97597482517425]
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception.<n>We propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal.
arXiv Detail & Related papers (2025-10-10T11:25:33Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging [69.2230254959204]
We propose RouteMark, a framework for IP protection in merged MoE models.<n>Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs.<n>For attribution and tampering detection, we introduce a similarity-based matching algorithm.
arXiv Detail & Related papers (2025-08-03T14:51:58Z) - EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models [25.12002287083368]
Multi-modal large language models (MLLMs) have increasingly adopted MoE techniques.<n>MoE experts are often by simply replicating the FFN parameters from LLMs.<n>Expert uniformity occurs because MoE experts are often by simply replicating the FFN parameters from LLMs.<n> router rigidity stems from the prevalent use of static linear routers for expert selection.
arXiv Detail & Related papers (2025-05-28T08:38:39Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - Improving Routing in Sparse Mixture of Experts with Graph of Tokens [32.46693871593765]
We unveil the limitation of Sparse Mixture of Experts (SMoE) through the perspective of the probabilistic graphical model (PGM)<n>We propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection.<n>We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations.
arXiv Detail & Related papers (2025-05-01T18:44:20Z) - Predicting Multi-Agent Specialization via Task Parallelizability [8.465921582175426]
We present a closed-form bound that predicts when specialization improves performance depending on task regimes and team size.<n>We validate our model on two standard MARL benchmarks that represent opposite benchmarks.<n>Three follow-up experiments in Overcooked-AI demonstrate that the model works in environments with more complex spatial and resource bottlenecks.
arXiv Detail & Related papers (2025-03-19T21:33:48Z) - CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [60.08485416687596]
Chain of Multi-modal Thought (CoMT) benchmark aims to mimic human-like reasoning that inherently integrates visual operation.<n>We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches.
arXiv Detail & Related papers (2024-12-17T14:10:16Z) - FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder.
Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z) - RS-MoE: A Vision-Language Model with Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering [23.699493284403967]
This paper proposes RS-MoE, a first Mixture of Expert based VLM specifically customized for remote sensing domain.<n>Unlike traditional MoE models, the core of RS-MoE is the MoE Block, which incorporates a novel Instruction Router and multiple lightweight Large Language Models (LLMs) as expert models.<n>We show that our model achieves state-of-the-art performance in generating precise and contextually relevant captions.
arXiv Detail & Related papers (2024-11-03T15:05:49Z) - Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model [22.103850646343915]
We use token-level gradient analysis to identify conflicting tokens in experts.<n>We then add a regularization loss tailored to encourage conflicting tokens routing from their current experts to other experts.<n>Our method can serve as a plug-in for diverse Large Vision-Language Models.
arXiv Detail & Related papers (2024-06-28T13:20:17Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.