Fugu-MT 論文翻訳(概要): Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

論文の概要: Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

arxiv url: http://arxiv.org/abs/2604.18655v2
Date: Fri, 24 Apr 2026 17:19:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 13:34:22.018153
Title: Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
Title（参考訳）: 複数LORAのエッジ展開とオンデバイスアクセラレーションのロック解除による一対一基本LDMの実現
Authors: Sravanth Kodavanti, Sowmya Vajrala, Srinivas Miriyala, Utsav Tiwari, Uttam Kumar, Utkarsh Kumar Mahawar, Achal Pratap Singh, Arya D, Narendra Mutyala, Vikram Nelvoy Rajendiran, Sharan Kumar Allur, Euntaik Lee, Dohyoung Kim, HyeonSu Lee, Gyusung Cho, JungBae Kim,
Abstract要約: 我々は,Samsung Galaxy S24およびS25デバイス上でのLLaMAに基づく多言語基盤モデルのデバイス上での効率的な推論のためのハードウェア・アウェア・フレームワークを提案する。本システムでは,9言語と8タスクの精度を維持しながら,メモリとレイテンシの全体的な4～6倍の改善を実現している。
参考スコア（独自算出の注目度）: 6.75883098679462
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
Abstract（参考訳）: スマートフォンに大規模言語モデル(LLM)をデプロイすることは、メモリ、レイテンシ、ランタイムの柔軟性に厳しい制約があるため、エンジニアリング上の大きな課題となる。本研究では,Samsung Galaxy S24およびS25デバイスにおけるSM8650およびSM8750 Qualcommチップセットの複数のユースケースをサポートするLLaMAベースの多言語基盤モデルのデバイス上での効率的な推論のためのハードウェア対応フレームワークを提案する。提案手法では,アプリケーション固有のLoRAを実行時入力として単一の凍結推論グラフに統合し,再コンパイルやメモリオーバーヘッドを伴わずに動的タスク切替を可能にする。さらに,複数ストリームの復号化機構を導入し,形式的,丁寧な応答,ジュビアル応答などのスタイリスティックな変動を1回のフォワードパスで同時に生成し,レイテンシを最大6倍に削減する。トークン生成を高速化するために,動的自己投機的復号法(DS2D)を適用した。 INT4の量子化とアーキテクチャレベルの最適化を組み合わせたシステムでは,9言語と8タスクの精度を維持しながら,メモリとレイテンシの全体的な4～6倍の改善を実現している。これらの結果は、エッジデバイスにマルチユースケースLSMをデプロイし、モバイルプラットフォームにおけるジェネレーティブAIの商業的実現性を向上させるための現実的な実現可能性を示している。

論文の概要: Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

関連論文リスト