Fugu-MT 論文翻訳(概要): Learning Adaptive LLM Decoding

論文の概要: Learning Adaptive LLM Decoding

arxiv url: http://arxiv.org/abs/2603.09065v1
Date: Tue, 10 Mar 2026 01:15:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:23.922911
Title: Learning Adaptive LLM Decoding
Title（参考訳）: 適応型LLMデコーディングの学習
Authors: Chloe H. Su, Zhe Ye, Samuel Tenka, Aidan Yang, Soonho Kong, Udaya Ghai,
Abstract要約: 我々は、利用可能な計算資源に基づいて、推論時にサンプリング戦略を動的に選択する適応型復号法を学習する。我々は、強化学習と検証可能な端末報酬で訓練された軽量デコードアダプタを導入する。実験により、学習したアダプタは精度と予算のトレードオフを改善することが示された。
参考スコア（独自算出の注目度）: 6.643962667713069
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.
Abstract（参考訳）: 大規模言語モデル(LLM)からのデコードは通常、プロンプトや個別のデコードステップ間でタスクの難易度や不確実性が大きく変化しているにもかかわらず、固定サンプリングハイパーパラメータ(例えば温度、トップp)に依存している。本稿では,利用可能な計算資源を前提とした推論時間におけるサンプリング戦略を動的に選択する適応型復号法について述べる。言語モデル自体を微調整する代わりに、強化学習と検証可能な終末報酬(数学やコーディングタスクの正確性など)で訓練された軽量な復号アダプタを導入する。各プロンプトに対するデコード戦略(例えばgreedy, top-k, min-p)を選択し、プロンプト埋め込みと並列サンプリング予算を条件とした。トークンレベルでは、部分的に観測可能なマルコフ決定プロセス(POMDP)としてデコーディングをモデル化し、内部モデルの特徴と残りのトークン予算に基づいて、ポリシーが各トークンステップでサンプリングアクションを選択する。 MATHとCodeContestsベンチマークの実験では、学習したアダプタは精度と予算のトレードオフを改善することが示されている。MATHでは、トークンレベルのアダプタは固定されたトークンの予算の下で最高の静的ベースラインよりも最大10.2%の精度でPass@1の精度を改善する一方、シーケンスレベルのアダプタは固定された並列サンプリングの下で2-3%のゲインを得る。アブレーション解析は、シーケンスレベルの適応とトークンレベルの適応の両方の寄与をサポートする。

論文の概要: Learning Adaptive LLM Decoding

関連論文リスト