Fugu-MT 論文翻訳(概要): MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

論文の概要: MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

arxiv url: http://arxiv.org/abs/2511.14102v1
Date: Tue, 18 Nov 2025 03:40:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:52.914387
Title: MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
Title（参考訳）: MoE-SpeQ:Mixture-of-Expertsのためのプロアクティブエキスパートプレフェッチとオフロードによる投機的量子デコーディング
Authors: Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang, Mingxuan Zhang, Chao Li, Minyi Guo,
Abstract要約: 提案するMoE-SpeQは,投機的実行と専門家のオフロードを共設計した新しい推論システムである。 MoE-SpeQは、将来のトークンに必要な専門家のシーケンスを予測するために、小さなオンデバイスドラフトモデルを採用している。 Phi-MoEモデルでは,MoE-SpeQは最先端のオフロードフレームワークよりも2.34倍の高速化を実現している。
参考スコア（独自算出の注目度）: 29.437264687850874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware.
Abstract（参考訳）: 最先端のMixture-of-Experts(MoE)モデルの膨大なメモリ要件は、推論に重大な課題を示し、しばしば1つの加速器の容量を超える。専門家をメモリホストにオフロードすることは一般的な解決策であるが、専門家選択のデータ依存の性質は、これらの同期転送を実行のクリティカルパスに直接配置し、パフォーマンスを損なうため、PCIeバスに深刻なI/Oボトルネックをもたらす。本稿では,データ移動の膨大なコストを隠蔽するために,少量の安価なオンデバイス計算を取引することで,I/Oボトルネックを克服できると主張している。提案するMoE-SpeQは,投機的実行と専門家のオフロードを共設計した新しい推論システムである。 MoE-SpeQは、将来のトークンに必要な専門家のシーケンスを予測するために、小さなオンデバイスドラフトモデルを採用している。この監視により、ランタイムオーケストレータは、これらの専門家をホストメモリからプレフェッチし、高価なI/Oを効果的にオーバーラップし、有用な計算を行い、クリティカルパスから遅延を隠すことができる。 Amortization Roofline Modelによって導かれる適応的な管理者は、性能を最大化するために、基盤となるハードウェアに投機戦略を動的に調整する。 Phi-MoEモデルでは,MoE-SpeQは最先端のオフロードフレームワークよりも2.34倍の高速化を実現している。我々の研究は、リソース制限された環境でのデータ依存メモリアクセスを管理するための、新しい原則的なアプローチを確立し、MoE推論をコモディティハードウェア上でよりアクセスしやすくする。

論文の概要: MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

関連論文リスト