Fugu-MT 論文翻訳(概要): ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

論文の概要: ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

arxiv url: http://arxiv.org/abs/2604.14626v2
Date: Thu, 23 Apr 2026 01:19:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:05.980011
Title: ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
Title（参考訳）: ELMoE-3D:MoEの固有弾性を生かしたハイブリッドボンディング型自己投機的復号法
Authors: Yuseon Choi, Jingu Lee, Jungjun Oh, Sunjoo Whang, Byeongcheol Kim, Minsung Kim, Hoi-Jun Yoo, Sangjin Kim,
Abstract要約: 本稿では,キャッシュベースのアクセラレーションと投機的復号化を統一するハイブリッドボンディングフレームワークELMoE-3Dを提案する。私たちの3Dスタックハードウェアでは、ELMoE-3DはxPU 1-16で提供される単純なMoEよりも平均6.6倍のスピードアップと4.4倍のエネルギー効率向上を実現しています。
参考スコア（独自算出の注目度）: 3.8457393423256363
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average $6.6\times$ speedup and $4.4\times$ energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers $2.2\times$ speedup and $1.4\times$ energy efficiency gain over the best-performing prior accelerator baseline.
Abstract（参考訳）: Mixture-of-Experts (MoE) モデルは大規模言語モデルにおいて支配的なアーキテクチャとなっているが、バッチ処理がメモリアクティベーションの希薄化を招き、オンプレミスでの利用は基本的にメモリバウンドのままである。メモリ中心アーキテクチャ(PIM、NMP)は帯域幅を改善するが、高いバッチサイズでMoEの低演算強度で計算を未使用のまま残す。投機的復号(SD)は、アイドル計算を少ないターゲット呼び出しで交換するが、検証は、拒否されたトークンであっても専門家をロードし、特にバッチサイズが低い場合には、MoEのメリットを著しく制限しなければならない。本稿では,ハイブリッドボンディング(HB)ベースのHW-SW協調設計フレームワークであるELMoE-3Dを提案する。我々は,MoE-Expert とbit-scale の2つの固有弾性軸を同定し,高い HB 帯域幅で加速されるエキスパートキャッシュと強整列自己ドラフトモデルの両方として機能する Elastic Self-Speculative Decoding (Elastic-SD) を構築する。我々のLSB拡張ビットスライスアーキテクチャは、ビットスライス表現の固有の冗長性を生かし、ビットスライス実行をネイティブにサポートする。当社の3Dスタックハードウェアでは、ELMoE-3Dは平均6.6\times$スピードアップと4.4\times$エネルギー効率向上を、xPUでバッチサイズ1-16で提供し、2.2\times$スピードアップと1.4\times$エネルギー効率向上を達成している。

論文の概要: ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

関連論文リスト