Fugu-MT 論文翻訳(概要): Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

論文の概要: Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

arxiv url: http://arxiv.org/abs/2511.04002v1
Date: Thu, 06 Nov 2025 02:55:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.283278
Title: Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing
Title（参考訳）: 適応スプリットコンピューティングによる大規模言語モデルのメモリ・レイテンシ制約推論
Authors: Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim, Sangseok Yun, Jae-Mo Kang,
Abstract要約: 大規模言語モデル(LLM)は様々な推論タスクでほぼ人間に近い性能を達成した。リソース制約のあるIoT(Internet-of-Things)デバイスへのデプロイメントは、大量のパラメータフットプリントとメモリ集約型の自己回帰デコーディングのため、依然として現実的ではない。この研究は、エッジデバイスにLLMを配置するために明示的に設計された最初の自動回帰対応分割コンピューティングフレームワークを紹介した。
参考スコア（独自算出の注目度）: 8.705453442427585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding. While split computing offers a promising solution by partitioning model execution between edge devices and cloud servers, existing approaches fail to address the unique challenges of autoregressive inference, particularly the iterative token generation process and expanding key-value (KV) cache requirements. This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices. Our approach makes three key contributions. First, we develop one-point split compression (OPSC), a mixed-precision quantization scheme that prevents out-of-memory failures by strategically partitioning models into front-end and back-end segments with different precision levels. Second, we propose a two-stage intermediate compression pipeline that combines threshold splitting (TS) and token-wise adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations while dramatically reducing communication overhead. Third, we formulate a unified optimization framework that jointly selects optimal split points, quantization settings, and sequence lengths to satisfy strict memory and latency constraints. Extensive evaluations across diverse LLMs and hardware platforms demonstrate superior performance compared to state-of-the-art quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework achieves a 1.49 inference speedup and significant communication overhead reduction while maintaining or improving model accuracy.
Abstract（参考訳）: 大規模言語モデル(LLM)は、さまざまな推論タスクでほぼ人間に近いパフォーマンスを達成したが、リソース制約のあるIoT(Internet-of-Things)デバイスへのデプロイメントは、大量のパラメータフットプリントとメモリ集約型の自己回帰デコーディングのため、現実的ではない。分割コンピューティングは、エッジデバイスとクラウドサーバ間でモデル実行を分割することで、有望なソリューションを提供するが、既存のアプローチでは、自動回帰推論、特に反復トークン生成プロセス、キー値(KV)キャッシュ要求の拡張といった、ユニークな課題に対処できない。この研究は、エッジデバイスにLLMを配置するために明示的に設計された最初の自動回帰対応分割コンピューティングフレームワークを紹介した。私たちのアプローチは3つの重要な貢献をします。まず,異なる精度のフロントエンドセグメントとバックエンドセグメントにモデルを戦略的に分割することで,メモリ外障害を防止する混合精度量子化方式である1点分割圧縮(OPSC)を開発する。次に、しきい値分割(TS)とトークンワイド適応ビット量子化(TAB-Q)を組み合わせた2段階中間圧縮パイプラインを提案する。第3に、厳密なメモリと遅延制約を満たすために最適な分割点、量子化設定、シーケンス長を共同で選択する統一最適化フレームワークを定式化する。多様なLLMとハードウェアプラットフォームにわたる広範な評価は、SmoothQuant、OmniQuant、Atomといった最先端の量子化手法よりも優れたパフォーマンスを示している。このフレームワークは、モデルの精度を維持したり改善したりしながら、1.49の推論スピードアップと通信オーバーヘッドの大幅な削減を実現している。

論文の概要: Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

関連論文リスト