Fugu-MT 論文翻訳(概要): Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

論文の概要: Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

arxiv url: http://arxiv.org/abs/2603.29535v1
Date: Tue, 31 Mar 2026 10:17:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.539477
Title: Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge
Title（参考訳）: エッジ上の複数LORAに基づく一対一生成視覚モデルを可能にする統一適応蒸留による量子化
Authors: Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti, Srinivas Soumitri Miriyala, Ashok Senapati,
Abstract要約: 画像編集、オブジェクト削除、プロンプト誘導画像変換などのGenAI機能は、モバイルアプリケーションにますます統合されている。既存のMobileデプロイメントパイプラインは通常、ローランドアダプタ毎に別々のモデルバイナリをコンパイルする。単一共有モデルを用いて,エッジデバイス上でのマルチタスクGenAI推論を可能にする統一フレームワークを提案する。
参考スコア（独自算出の注目度）: 4.632054706878866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.
Abstract（参考訳）: 画像編集、オブジェクト除去、プロンプト誘導画像変換といったジェネレーティブ人工知能(GenAI)機能は、モバイルアプリケーションにますます統合されている。しかしながら、リソース制約のあるデバイスにそのようなタスクのためにLVM(Large Vision Models)をデプロイすることは、高いメモリと計算要求のため、依然として困難である。 Low-Rank Adapters (LoRA)はパラメータ効率のよいタスク適応を可能にするが、既存のMobileデプロイメントパイプラインは通常、LoRA毎に別々のモデルバイナリをコンパイルする。本研究では,単一共有モデルを用いて,エッジデバイス上でのマルチタスクGenAI推論を可能にする統一フレームワークを提案する。私たちのキーとなるアイデアは、LoRA重みをコンパイルされたモデルグラフに埋め込むのではなく、ランタイム入力として扱うことです。次に、デバイス上での効率的な実行を支援するために、共有量子化プロファイルの下で複数のLoRAアダプタを整列させる量子化学習戦略であるQUID(Quantization with Unified Adaptive Distillation)を導入する。提案システムは,モバイルNPUと互換性のある軽量ランタイムスタックで実装し,複数のチップセットで評価する。実験結果は、メモリフットプリントの最大6倍と4倍の削減と、複数のGenAIタスクの視覚的品質を維持しながら、レイテンシの改善を示している。

論文の概要: Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

関連論文リスト