Fugu-MT 論文翻訳(概要): Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

論文の概要: Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

arxiv url: http://arxiv.org/abs/2606.04581v1
Date: Wed, 03 Jun 2026 08:16:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.625639
Title: Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge
Title（参考訳）: Multi-SPIN:エッジでの協調トークン生成のためのマルチアクセシブル推論
Authors: Haotian Zheng, Zhanwei Wang, Mingyao Cui, Chang Cai, Hongyang Du, Kaibin Huang,
Abstract要約: 大規模言語モデル(LLM)を高速化する効率的なアーキテクチャとして、投機的推論を導入する。本研究では,マルチユーザエッジシステムにおける協調トークン生成を実現するための分散配置を提案する。分解法の開発により、複雑な最適化をトラクタブルなサブプロブレムに還元する。分析の結果、最適な帯域割り当ては、より弱い計算/通信能力を持つユーザを補うことがわかった。
参考スコア（独自算出の注目度）: 47.08741228523925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work, we propose its distributed deployment to enable cooperative token generation in a multiuser edge system; its advantage is to effectively balance computational loads between resource-constrained devices and servers. The resulting architecture, termed Multi-access SPIN (Multi-SPIN), utilizes on-device small language models to generate and upload candidate token drafts, while an edge server operates the LLM to verify them in parallel batches. Given the severe heterogeneity in users' computation and communication capabilities, the draft length emerges as a critical control variable that influences node-level computation loads and multi-access latency, thereby governing the sum token goodput. Consequently, considering frequency-division multiple access, we investigate the problem of multi-access draft control, a joint optimization of draft-length control and bandwidth allocation to maximize sum token goodput. We examine two cases: (1) homogeneous draft lengths across users to facilitate server-side batching, and (2) heterogeneous draft lengths to introduce a new dimension for goodput enhancement. By developing decomposition methods, we reduce these complex optimizations into tractable sub-problems, which allow efficient draft control algorithms to be derived in closed form. Our analysis shows that the optimal bandwidth allocation compensates users with weaker computation-and-communication capabilities in the homogeneous case due to the batching synchronization requirements, whereas its heterogeneous-case counterpart rewards users with higher acceptance rates by relaxing such requirements. Experiments using Llama-2 and Qwen3.5 model pairs across diverse tasks demonstrate that Multi-SPIN improves goodput by up to 88% over heterogeneity-agnostic baselines.
Abstract（参考訳）: 投機的推論 (SPIN) は、もともとLarge Language Models (LLM) を高速化する効率的なアーキテクチャとして開発された。本研究では,マルチユーザエッジシステムにおける協調トークン生成を実現するための分散配置を提案する。その利点は,リソース制約されたデバイスとサーバ間の計算負荷を効果的にバランスさせることである。結果として得られたアーキテクチャはMulti-SPIN(Multi-SPIN)と呼ばれ、デバイス上の小さな言語モデルを使用して候補トークンのドラフトを生成し、アップロードする。ユーザの計算能力と通信能力の重大な不均一性を考えると、ドラフト長はノードレベルの計算負荷とマルチアクセスレイテンシに影響を与える重要な制御変数として現れ、和トークンの出力を管理する。その結果、周波数分割多重アクセスを考慮し、複数アクセスのドラフト制御、ドラフト長制御の併用最適化、および合計トークン出力の最大化のための帯域幅割り当ての問題を考察した。本研究では,(1)サーバ側バッチ処理を容易にするためにユーザ間で均質なドラフト長,(2)出力向上のための新しい次元を導入するための異質なドラフト長の2つの事例について検討する。分解法の開発により、これらの複雑な最適化をトラクタブルなサブプロブレムに還元し、効率的なドラフト制御アルゴリズムをクローズドな形で導出できるようにする。分析の結果,帯域幅の最適割り当ては,バッチ同期要求による均質ケースでの計算・通信能力の低下を補うが,不均一ケースでは,そのような要求を緩和することで高い受入率のユーザに報奨を与えることがわかった。 Llama-2 と Qwen3.5 モデルペアを様々なタスクで実験したところ、マルチSPIN は不均一性に依存しないベースラインよりも最大88%向上することが示された。

論文の概要: Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

関連論文リスト