Fugu-MT 論文翻訳(概要): MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

論文の概要: MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

arxiv url: http://arxiv.org/abs/2510.12357v1
Date: Tue, 14 Oct 2025 10:22:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.276676
Title: MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
Title（参考訳）: MoBiLE: 大規模エキスパートの混在による消費者向けGPUの効率よい混合仕様推論
Authors: Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, Shouyi Yin,
Abstract要約: MoBiLEは、プラグイン・アンド・プレイのオフロードベースのMoE推論フレームワークで、大手専門家のテキストミキサーを備えている。 MoBiLEは、コンシューマGPUシステムのベースラインと比較して1.60倍から1.72倍のスピードアップを実現し、精度の劣化は無視できる。
参考スコア（独自算出の注目度）: 17.518573710849513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with \textit{mixture of big-little experts}. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
Abstract（参考訳）: Mixture-of-Experts (MoE)モデルは、最近、様々なアプリケーションにまたがる例外的なパフォーマンスを実証した。 MoEモデルのスパースアクティベーションの原則はオフロード戦略を促進し、アクティブエキスパートはGPU HBMに、非アクティブエキスパートはCPU DRAMに格納される。しかし、このアプローチの有効性は、CPU-GPU相互接続の帯域幅の制限によって根本的に制限されている。このボトルネックを軽減するために、既存のアプローチでは、MoE推論を加速するためにプレフェッチを採用している。これらの手法は、特別に訓練されたモジュールを使用して、必要な専門家を予測し、予測しようとする。それにもかかわらず、そのような技術は訓練のオーバーヘッドがかなり大きいため、最近のMoEモデルでは、細粒度の専門家セグメンテーションによる効果が低下している。本稿では,MoBiLEを提案する。このMoBiLEは,大規模専門家のtextit{mixture of big-little experts} を組み込んだ,プラグアンドプレイオフロードベースのMoE推論フレームワークである。これにより、重要でないトークンのエキスパートを半分に減らし、モデルの品質を保証する重要なトークンのエキスパートをフルに維持する。さらに、小さな専門家と大きな専門家を切り替えてメモリ効率を向上させるために、専用のフォールバックとプリフェッチ機構が設計されている。 MoBiLEを4つのモダンなMoEアーキテクチャで評価し,その有効性を検証した。以上の結果から,MoBiLEはGPUシステムのベースラインに比べて1.60倍から1.72倍の高速化を実現し,精度の劣化は無視できることがわかった。

論文の概要: MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

関連論文リスト