Fugu-MT 論文翻訳(概要): Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

論文の概要: Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

arxiv url: http://arxiv.org/abs/2509.08342v1
Date: Wed, 10 Sep 2025 07:28:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-11 15:16:52.339831
Title: Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
Title（参考訳）: アダプティブ・エキスパート・スプリット・メカニズムによるMixture-of-Expert推論の高速化
Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang,
Abstract要約: MoEpicは、新しい専門家分割機構を備えた効率的なMoE推論システムである。人気のあるMoE LLMの実験は、MoEpicがGPUコストの約半分を節約できることを示した。
参考スコア（独自算出の注目度）: 29.862588578556366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer's inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer's VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.
Abstract（参考訳）: Mixture-of-Experts (MoE)は、現代の大規模言語モデル(LLM)のための有望なアーキテクチャとして登場した。しかし、大量のパラメータは重いGPUメモリ(すなわちVRAM)を必要とするため、MoE LLMの普及を妨げている。エキスパートパラメータをCPU RAMにオフロードすると、MoE推論のVRAM要求が軽減される。既存のアプローチは通常、VRAMの専門家の小さなサブセットを推論中にRAMから動的にプリフェッチする専門家をキャッシュする。本研究では,新しい専門家分割機構を備えた効率的なMoE推論システムであるMoEpicを提案する。具体的には、各専門家は上下に2つのセグメントに分けられる。 MoEpicはホットエキスパートのトップセグメントをキャッシュするので、より多くのエキスパートを限られたVRAM予算で保存し、キャッシュヒット率を改善することができる。各レイヤの推論において、MoEpicは次のレイヤのアクティベートした専門家を予測し、プレフィックスする。キャッシュされた専門家のトップセグメントはフェッチを免除されるため、ロード時間が短縮され、効率的な転送/計算オーバーラップが可能になる。それでも、MoEpicのパフォーマンスは、キャッシュ構成(すなわち、各レイヤのVRAM予算とエキスパート分割比率)に大きく依存する。そこで本研究では,適応型キャッシュ構成のための固定点反復に基づく分割・列化アルゴリズムを提案する。人気のあるMoE LLMの大規模な実験により、MoEpicはGPUコストの約半分を節約できる一方で、ベースラインと比較して推論遅延を約37.51%-65.73%削減できることが示された。

論文の概要: Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

関連論文リスト