Fugu-MT 論文翻訳(概要): WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware

論文の概要: WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware

arxiv url: http://arxiv.org/abs/2606.21868v1
Date: Sat, 20 Jun 2026 04:10:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 02:35:54.048386
Title: WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware
Title（参考訳）: WiSP: 極低リソースハードウェアを応用したMixture-of-Expertsの作業セットビュー
Authors: Jiamu Zhang, Liang Wu, Mayank Darbari, Liangjie Hong,
Abstract要約: Mixture-of-Experts(MoE)モデルは、パラメータの大部分をエキスパート層に配置するが、トークンに使用する専門家はごくわずかである。メモリ要求の2つのストリームが、限られたVRAMとどのように競合するかを示す。作業負荷の再利用は専門家のみに委ねられるため、WiSPは静的オフロードのデコードスループットの最大1.95倍に達する。
参考スコア（独自算出の注目度）: 7.113530862077688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern Mixture-of-Experts (MoE) models place most of their parameters in expert layers, yet only a small fraction of those experts are used for any token. The unused weights must still be stored where the GPU can reach them. On commodity GPUs the common fix is layer-level CPU offloading, which keeps memory low but streams all of a layer's experts across PCIe on every forward pass, losing much of MoE's sparsity benefit. We cast low-resource MoE serving as a working-set management problem on the GPU: routed expert weights and the key-value (KV) cache are two streams of memory demand competing for limited VRAM. We realize this in WiSP (Working-Set Paging), a routing-aware expert pager that plugs into an unmodified serving engine with byte-identical outputs. Keeping resident only the experts a workload reuses, WiSP reaches up to 1.95x the decode throughput of static offload at the same memory budget when the model does not fit. We also find that prefetching experts from predicted routing helps little in single-stream decode: the bottleneck is PCIe bandwidth, not prediction accuracy. This shifts the question from prefetching to allocation: how should VRAM be split between experts and the KV cache? We answer with MV-WSA (Marginal-Value Working-Set Allocation), which equalizes marginal latency benefit per byte subject to a KV admission floor. MV-WSA runs either as an offline configurator or as an online controller that resizes both pools while serving. In real serving the offline configurator is the only policy we test that does well on both prefill and decode; in trace-driven simulation it stays within a few percent of a per-workflow oracle while fixed splits are about 20% worse. The online controller adds a further 1.20x without changing model outputs.
Abstract（参考訳）: 現代のMixture-of-Experts(MoE)モデルは、パラメータの大部分をエキスパート層に配置するが、トークンに使用する専門家はごくわずかである。未使用の重みは、GPUが到達可能な場所に保存されなければならない。一般的なGPUでは、レイヤレベルのCPUオフロードが一般的で、メモリを低く保ちながら、PCIeのすべてのレイヤの専門家をフォワードパスでストリームする。ルーティングされた専門家の重み付けとキー値(KV)キャッシュは、限られたVRAMと競合する2つのメモリ要求ストリームである。我々はこれをWiSP (Working-Set Paging) で実現した。これはルーティング対応のエキスパートページラで、修正されていないサーブエンジンにバイト単位の出力で接続する。ワークロードの再利用は専門家のみに委ねられるため、WiSPはモデルに適合しない場合と同じメモリ予算で、静的オフロードのデコードスループットを最大1.95倍に向上させる。また、予測ルーティングからのプレフェッチの専門家は、単一ストリームのデコードではほとんど役に立たない:ボトルネックはPCIe帯域幅であり、予測精度は高くない。 VRAMを専門家とKVキャッシュに分割するにはどうすればよいのか? MV-WSA(Marginal-Value Working-Set Allocation)では,KV入力フロアのバイト当たりの差分レイテンシのメリットを等しくする。 MV-WSAはオフラインのコンフィグレータとして、あるいはサービス中に両方のプールをリサイズするオンラインコントローラとして動作する。トレース駆動のシミュレーションでは、ワークフローごとのオラクルの数パーセントに留まり、固定分割は20%ほど悪化します。オンラインコントローラは、モデル出力を変更することなくさらに1.20倍を追加する。

論文の概要: WiSP: A Working-Set View of Mixture-of-Experts Serving on Extremely Low-Resource Hardware

関連論文リスト