Fugu-MT 論文翻訳(概要): Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

論文の概要: Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

arxiv url: http://arxiv.org/abs/2605.04341v1
Date: Tue, 05 May 2026 22:59:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 18:41:07.569282
Title: Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference
Title（参考訳）: Budgeted LoRA:効率的な推論のための構造化計算機配置としての蒸留
Authors: Mohammed Sabry, Anya Belz,
Abstract要約: Budgeted LoRAは、モデル圧縮を構造化された計算割り当て問題として扱うフレームワークである。 Budgeted LoRA は標準の LoRA パープレキシティを1.74倍の圧縮加群高速化で適度な予算で一致していることを示す。適度なパープレキシティ劣化を伴う4.05倍の高速化を実現し、関数型インコンテキスト学習プローブの精度を向上する。
参考スコア（独自算出の注目度）: 6.886536285117155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study distillation for large language models under explicit compute constraints, with the goal of producing student models that are not only cheaper to train, but structurally efficient at inference time. While prior approaches to parameter-efficient distillation, such as LoRA, reduce adaptation cost, they leave the dense backbone unchanged and therefore fail to deliver meaningful inference savings. We propose Budgeted LoRA, a distillation framework that treats model compression as a structured compute allocation problem. Instead of using a fixed student architecture, we introduce a global compute budget that sets the final target fraction of dense computation retained. Under this constraint, the model redistributes capacity across dense and low-rank pathways via (i) module-level dense retention coefficients, (ii) adaptive low-rank allocation, and (iii) post-training compression that selectively removes, approximates, or preserves dense components. This formulation yields a family of students controlled by a single budget dial. Empirically, Budgeted LoRA matches standard LoRA perplexity at a moderate budget with a 1.74x compressed-module speedup; at an aggressive budget it achieves a 4.05x speedup with moderate perplexity degradation, and it preserves higher accuracy on function-style in-context learning probes. These results suggest that, under compute-constrained distillation, retaining behavior is less about matching perplexity or removing more parameters than it is about controlling how dense computation is transferred to low-rank pathways.
Abstract（参考訳）: 本研究の目的は, 学習費が安いだけでなく, 推論時に構造的に効率的である大規模言語モデルの蒸留を, 明示的な計算制約下で研究することである。 LoRAのようなパラメータ効率の高い蒸留への以前のアプローチは適応コストを下げるが、密度の高いバックボーンはそのまま残すため、意味のある推論の節約は得られない。本稿では,モデル圧縮を構造化計算割当問題として扱う蒸留フレームワークであるBudgeted LoRAを提案する。固定された学生アーキテクチャを使う代わりに、大域的な計算予算を導入する。この制約の下で、モデルは高密度および低ランクの経路にまたがる容量を再分配する (i)モジュールレベルの密度保持係数 (二)適応型低ランク割当、及び三濃厚成分を選択的に除去し、近似し、保存する後処理圧縮。この定式化は、1つの予算ダイヤルによって管理される学生の家族を産み出す。実証的には、Budgeted LoRAは標準のLoRAパープレクティリティを1.74倍圧縮モジュールのスピードアップと適度な予算で一致させるが、アグレッシブな予算では、適度なパープレクティリティ劣化を伴う4.05倍のスピードアップを実現し、関数スタイルのインコンテキスト学習プローブの精度を維持する。これらの結果から, 蒸留条件下での保持挙動は, 低ランク経路への密度計算の伝達の制御よりも, パープレキシティの整合性やパラメータの除去が重視されることが示唆された。

論文の概要: Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

関連論文リスト