Related papers: SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

URL: http://arxiv.org/abs/2602.01990v1
Date: Mon, 02 Feb 2026 11:47:06 GMT
Title: SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning
Authors: Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou,
Abstract summary: We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
Score: 83.66308307152808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.

Related papers

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning [23.18318273534301]
A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network.<n>MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters.<n>We propose textbfPhase-Aware Mixture of Experts (PA-MoE).<n>It first features a lightweight emphphase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories.<n>Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise
arXiv Detail & Related papers (2026-02-19T03:18:30Z)
SMES: Towards Scalable Multi-Task Recommendation via Expert Sparsity [47.79376327982703]
Industrial recommender systems rely on multi-task learning to estimate diverse user feedback signals and aggregate them for ranking.<n>Recent advances in model scaling have shown promising gains in recommendation.<n>This mismatch between uniform parameter scaling and heterogeneous task capacity demands poses a fundamental challenge for scalable multi-task recommendation.
arXiv Detail & Related papers (2026-02-10T03:56:12Z)
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization [13.182475975397251]
ERMoE is a sparse MoE transformer that replaces learned gating logits with an "Eigenbasis Score"<n>We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks.<n>A 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields interpretable expert specializations.
arXiv Detail & Related papers (2025-11-14T05:31:37Z)
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z)
Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning [49.90176890917986]
Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL)<n>Existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing.<n>We propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts.
arXiv Detail & Related papers (2025-10-01T06:49:19Z)
Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs [54.95810313530111]
DERN is a task-agnostic and retraining-free framework for expert pruning and reconstruction.<n>It improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity.
arXiv Detail & Related papers (2025-09-12T16:09:39Z)
SEE: Continual Fine-tuning with Sequential Ensemble of Experts [25.96255683276355]
Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting.<n>We introduce the Sequential Ensemble of Experts (SEE) framework.<n>SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled.
arXiv Detail & Related papers (2025-04-09T07:56:56Z)
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models [21.888139819188105]
LLaVA-CMoE is a continual learning framework for large language models.<n> Probe-Guided Knowledge Extension mechanism determines when and where new experts should be added.<n>Probabilistic Task Locator assigns each task a dedicated, lightweight router.
arXiv Detail & Related papers (2025-03-27T07:36:11Z)
Complexity Experts are Task-Discriminative Learners for Any Image Restoration [80.46313715427928]
We introduce complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields.<n>This preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity.<n>The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability.
arXiv Detail & Related papers (2024-11-27T15:58:07Z)
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z)
Learning from Guided Play: Improving Exploration for Adversarial Imitation Learning with Simple Auxiliary Tasks [8.320969283401233]
We show that the standard, naive approach to exploration can manifest as a suboptimal local maximum. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks.
arXiv Detail & Related papers (2022-12-30T20:38:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.