Fugu-MT 論文翻訳(概要): Compute Where it Counts: Self Optimizing Language Models

論文の概要: Compute Where it Counts: Self Optimizing Language Models

arxiv url: http://arxiv.org/abs/2605.10875v1
Date: Mon, 11 May 2026 17:27:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:51.029584
Title: Compute Where it Counts: Self Optimizing Language Models
Title（参考訳）: Compute where it counts: Self Optimizing Language Models
Authors: Yash Akhauri, Mohamed S. Abdelfattah,
Abstract要約: 自己回帰復号化のための動的予算配分について検討する。我々は,教師の指導するエピソードに対して,グループ相対的な政策最適化を用いて政策を訓練する。私たちの報酬は、言語モデルの品質と、エピソード平均の予算使用を促進するソフトなペナルティとのトレードオフです。
参考スコア（独自算出の注目度）: 10.058821474955177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.
Abstract（参考訳）: 効率的なLLM推論研究は、デコーディングの各ステップ(例えば、量子化、プルーニング、スパークアテンション)のコスト削減に重点を置いており、典型的には、生成されたトークンごとに均一な計算予算を適用する。実際、トークンの難易度は様々であり、静的圧縮は簡単なステップで過剰に計算し、ハードなステップで過度に計算することができる。自動回帰復号化のための動的予算配分について検討し、単一のモデルからトークン1枚あたりの計算量を学習する。自己最適化言語モデル(SOL)は、LLM隠れ状態を読み出し、各デコードステップで離散効率動作を選択する軽量ポリシーネットワークと凍結LDMをペアリングする。行動は共同で制御できる (i)トークンレベルの注意空間 (II)MLPにおける構造的活性化プルーニング、及び三アクティベーション量子化ビット幅、ベースモデルの重みは変わらない。トークンシーケンスは固定されているが、複数の計算スケジュール(例えば、同じトークンパスの効率動作だけが異なる「数値」スケジュール)をサンプリングし、同じ監督下でそれらの可能性を比較する。我々の報酬は言語モデルの品質を、要求された目標に合うように平均的な予算使用を促すソフトペナルティと引き換えにしています。モデル変種と計算システム全体で、SOLは静的アロケーションと強いランダムスケジュール探索よりも、一致した予算における品質を改善し、推論効率最適化のための補完軸を提供する。 SOLは、全ての実験において、より良い品質効率を前もって発見し、均一な予算配分戦略よりも、MMLUの精度を最大7.3%向上させる。

論文の概要: Compute Where it Counts: Self Optimizing Language Models

関連論文リスト