Fugu-MT 論文翻訳(概要): MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

論文の概要: MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

arxiv url: http://arxiv.org/abs/2511.15690v1
Date: Wed, 19 Nov 2025 18:48:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.941555
Title: MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Title（参考訳）: MoDES: 動的エキスパートスキッピングによるマルチモーダル大規模言語モデルの混合高速化
Authors: Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang,
Abstract要約: 我々は,MoE MLLM推論を効果的かつ正確なものにするために,専門家を適応的にスキップする最初のトレーニングフリーフレームワークであるMoDESを提案する。 MoDESは推論速度を大幅に向上させ、プリフィルタイムを2.16$times$、デコードタイムを1.26$times$に改善する。
参考スコア（独自算出の注目度）: 52.02659589971978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
Abstract（参考訳）: Mixture-of-Experts (MoE) Multimodal Large Language Model (MLLM) は視覚言語タスクに優れるが、高い計算効率に悩まされる。推定オーバーヘッドを低減するため、現在の入力トークンに基づいて冗長な専門家を非活性化する専門家スキップ法が提案されている。しかし、これらの手法を元来、LLM(unimodal large language model)からMLLM(MLLM)に適用すると、性能が著しく低下することがわかった。これは主に、これらの手法がMoE層にまたがる専門家の不均一な貢献や、これらの層内のトークンのモダリティ固有の振る舞いを説明できないためである。これらの知見に触発されたMoDESは、専門家を適応的にスキップし、効率よく正確なMoE MLLM推論を可能にする最初のトレーニングフリーフレームワークである。グローバルな調整されたローカルゲーティング(GMLG)機構が組み込まれており、グローバルなレイヤワイドな重要度をローカルルーティングの確率と統合して、個々の専門家の重要度を正確に見積もっている。次に、各モードからのトークンを別々に処理し、スキップスケジュールを導出するデュアルモードしきい値法(DMT)を適用する。最適しきい値を設定するために,モノトニック性を利用したフロンティア探索アルゴリズムを導入し,収束時間を数日から数時間に短縮する。 13ベンチマークにわたる3つのモデルシリーズの大規模な実験は、MoDESがこれまでのアプローチよりはるかに優れていることを示している。例えば、Qwen3-VL-MoE-30B-A3B-Instructで88%のエキスパートをスキップすると、パフォーマンスが10.67%向上する(97.33%対86.66%)。さらに、MoDESは推論速度を大幅に向上させ、プリフィル時間 2.16$\times$ とデコード時間 1.26$\times$ を改善した。

論文の概要: MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

関連論文リスト