Fugu-MT 論文翻訳(概要): LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

論文の概要: LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

arxiv url: http://arxiv.org/abs/2601.02569v1
Date: Mon, 05 Jan 2026 21:47:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-07 17:02:12.732087
Title: LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference
Title（参考訳）: LoRA-Drop: 効率的なLLM推論のための一時ロラ復号法
Authors: Hossein Rajabzadeh, Maryam Dialameh, Chul B. Park, Il-Min Kim, Hyock Ju Kwon,
Abstract要約: textbfLoRA-Dropは,中間層の固定部分集合に時空間計算スケジュールを適用することでデコーディングを高速化する,プラグアンドプレイ推論フレームワークである。 LoRA-Dropはルーティングネットワークを必要とせず、標準的なKVキャッシュと互換性があり、LoRAステップ中にドロップパブル層でKV更新をスキップして定期的にリフレッシュすることで、KVキャッシュのフットプリントを削減することができる。
参考スコア（独自算出の注目度）: 4.771191010592548
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across \textbf{LLaMA2-7B}, \textbf{LLaMA3-8B}, \textbf{Qwen2.5-7B}, and \textbf{Qwen2.5-14B}, LoRA-Drop achieves up to \textbf{2.6$\times$ faster decoding} and \textbf{45--55\% KV-cache reduction} while staying within \textbf{0.5 percentage points (pp)} of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \emph{safe zone} of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA-Drop.git.
Abstract（参考訳）: 自動回帰型大言語モデル(LLM)はシーケンシャルデコーディングによってボトルネックになり、各トークンは典型的にはすべてのトランスフォーマー層を実行する必要がある。既存の動的深さ法と層スキッピング法は、このコストを削減するが、しばしば補助的なルーティング機構や、バイパスされた層が未補償のまま残されている場合の精度劣化に依存する。中間層の固定された部分集合に \emph{temporal compute schedule} を適用してデコーディングを高速化するプラグイン・アンド・プレイの推論フレームワークである \textbf{LoRA-Drop} について述べる。 LoRA-Dropはルーティングネットワークを必要とせず、標準的なKVキャッシュと互換性があり、LoRAステップ中にドロップパブル層でKV更新をスキップし、定期的にリフレッシュすることで、KVキャッシュのフットプリントを削減することができる。 textbf{LLaMA2-7B}, \textbf{LLaMA3-8B}, \textbf{Qwen2.5-7B}, \textbf{Qwen2.5-14B}, \textbf{Qwen2.5-14B}, \textbf{2.6$\times$ faster decoding}, \textbf{45-55\% KV-cache reduction} はベースライン精度の \textbf{0.5 ポイント (pp)} に留まる。推論(GSM8K, MATH, BBH)、コード生成(HumanEval, MBPP)、長期コンテキスト/多重化ベンチマーク(LongBench, XNLI, XCOPA)の評価では、LLMにおける適応容量推論への簡単な経路を提供しながら、大幅な効率向上を達成しながら品質を保ちながら、スケジューリング設定の一貫性のある 'emph{safe Zone} を特定する。コードはhttps://github.com/hosseinbv/LoRA-Drop.gitで公開されている。

論文の概要: LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference

関連論文リスト