Fugu-MT 論文翻訳(概要): MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

論文の概要: MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

arxiv url: http://arxiv.org/abs/2603.15954v1
Date: Mon, 16 Mar 2026 22:10:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.012646
Title: MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale
Title（参考訳）: MobileLLM-Flash: 産業規模向け遅延ガイドオンデバイスLCM設計
Authors: Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi,
Abstract要約: リアルタイムAIエクスペリエンスは、リソース制約のあるハードウェアへの効率的なデプロイのために最適化されたデバイス上の大規模言語モデル(OD-LLM)を要求する。本稿では,モバイル遅延制約下でのハードウェア・イン・ザ・ループアーキテクチャ・サーチを用いたモデル設計手法を提案する。
参考スコア（独自算出の注目度）: 36.89558970450915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.
Abstract（参考訳）: リアルタイムAIエクスペリエンスは、リソース制約のあるハードウェアへの効率的なデプロイのために最適化されたデバイス上の大規模言語モデル(OD-LLM)を要求する。最も有用なOD-LLMは、ほぼリアルタイムの応答を生成し、幅広いハードウェア互換性を示し、ユーザリーチを最大化する。本稿では,モバイル遅延制約下でのハードウェア・イン・ザ・ループアーキテクチャ・サーチを用いたモデル設計手法を提案する。カスタムカーネルなしでデプロイ可能なモデルを生成し、Executorchのような標準モバイルランタイムと互換性がある。本手法は,特に注意機構を回避し,長文アクセラレーションに注意スキップを用いる。我々のアプローチは、モデルアーキテクチャ(層、次元)とアテンションパターンを共同で最適化する。候補を効率よく評価するために,プレトレーニングしたバックボーンの刈り取り版を継承重み付きで処理し,最小限の継続事前トレーニングで高い精度が得られるようにした。まず正確なレイテンシモデルを学び、次にレイテンシと品質をまたいだPareto-frontierを探します。これによりMobileLLM-Flashは、強力な機能を備えたデバイス上での効率的な使用が可能で、最大8kコンテキスト長をサポートする基盤モデル(350M、650M、1.4B)のファミリーである。 MobileLLM-Flashは最大1.8倍、1.6倍高速なプリフィルとデコードを提供する。 Pareto-frontier設計の選択に関する我々の分析は、OD-LLM設計の実用的な原則を提供する。

論文の概要: MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

関連論文リスト