Fugu-MT 論文翻訳(概要): Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

論文の概要: Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

arxiv url: http://arxiv.org/abs/2511.04393v1
Date: Thu, 06 Nov 2025 14:21:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.448237
Title: Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
Title（参考訳）: より優れた意思決定エージェントとしての学習後LLM:レグレト最小化アプローチ
Authors: Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang,
Abstract要約: 反復回帰最小化ファインタニング(Iterative Regret-Minimization Fine-Tuning)は、低回帰決定軌跡をベースモデルに蒸留するポストトレーニング手順である。このモデル生成推論への依存は、厳密な出力エンジニアリングを回避し、より柔軟で自然言語の訓練信号を提供する。 RMFT は LLM の DM 性能を多種多様なモデルで改善する。
参考スコア（独自算出の注目度）: 37.78174504569736
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model's own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs' DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs' decision-making capabilities.
Abstract（参考訳）: 大規模言語モデル(LLM)は、インタラクティブでダイナミックな環境での意思決定(DM)のための"エージェント"として、ますます多くデプロイされている。しかし、当初はDM用に設計されていなかったため、最近の研究では、LCMは基本的なオンラインDM問題でも苦労し、後悔や効果的な探査・探索のトレードオフを達成できないことが示されている。そこで本研究では,低反射性決定軌道をベースモデルに繰り返し蒸留する後訓練法であるIterative Regret-Minimization Fine-Tuning (Iterative RMFT)を提案する。各イテレーションにおいて、モデルは複数の決定軌道をロールアウトし、最も低い後悔点を選択し、それらを微調整する。以前の方法とは違って (a)既知のDMアルゴリズムから作用配列を蒸留する b) 手作業によるチェーン・オブ・プリートテンプレートを頼りにしており、この手法では、後悔の度合いを活用して、モデル自身のDM能力を引き出すとともに、合理的な推論を行う。このモデル生成推論への依存は、厳密な出力エンジニアリングを回避し、より柔軟で自然言語の訓練信号を提供する。実験結果によると、反復RMFTは、数値入力/出力を持つトランスフォーマーからオープンウェイトLLM、GPT-4o miniのような高度なクローズドウェイトモデルまで、様々なモデルにわたるLCMのDM性能を改善する。その出力および推論フォーマットの柔軟性は、様々な水平線、アクション空間、報酬プロセス、自然言語コンテキストを持つタスクをまたいだ一般化を可能にする。最後に,このパラダイムの下の単一層トランスフォーマーが,簡易な設定で学習者なしで動作可能であることを示す理論的洞察を提供する。全体として、Iterative RMFTはLLMの意思決定能力を高めるための原則的で一般的なポストトレーニングフレームワークを提供する。

論文の概要: Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

関連論文リスト