Fugu-MT 論文翻訳(概要): When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

論文の概要: When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

arxiv url: http://arxiv.org/abs/2509.24923v1
Date: Mon, 29 Sep 2025 15:25:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.091815
Title: When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
Title（参考訳）: メタバンディット LLM トレーニングにおける創発的爆発バイアス
Authors: Sanxing Chen, Xiaoyin Chen, Yukun Huang, Roy Xie, Bhuwan Dhingra,
Abstract要約: 大規模言語モデル(LLM)は、しばしばシーケンシャルな意思決定において最適に探索する。最近の研究は、教師付き微調整(SFT)や強化学習(RL)を通じてこの能力を向上し、古典的なマルチアームバンディットタスクの後悔を改善することを目指している。本研究では,SFT を用いた LLM の訓練と,RL による様々な報奨信号の学習により,両パラダイムについて検討する。その結果、エージェントは事前訓練されたモデルより優れ、アッパー信頼境界(UCB)やトンプソンサンプリングに匹敵するパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 26.66184262287797
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6x longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自律的なエージェントになることを約束する一方で、シーケンシャルな意思決定において、亜最適に探索することが多い。最近の研究は、教師付き微調整(SFT)や強化学習(RL)を通じてこの能力を向上し、古典的なマルチアームバンディットタスクの後悔を改善することを目指している。しかし、これらの学習手法が探索戦略をどのように形成し、どのように一般化するかは定かではない。両パラダイムを、専門家の軌道上でSFTを用いて学習し、RLは、分散を抑えるための戦略的、後悔の形をした報酬や、オラクルの模倣を可能にするアルゴリズム的な報酬を含む、様々な調整された報酬信号を用いて検討する。結果として得られたエージェントは事前訓練されたモデルより優れ、アッパー・信頼境界(UCB)やトンプソン・サンプリングに匹敵する性能を達成し、より長い地平線を6倍、バンドイット族をまたいだ堅牢な一般化を実現した。 RL/SFTエージェントは、事前訓練されたモデルよりも早期に破滅的な失敗をしやすく、調査を早期に放棄する。さらに、UCBを模倣する訓練を受けたエージェントは、より搾取的なバリエーションを採用することで、教師よりも優れたパフォーマンスを身につける。本研究は,各トレーニングパラダイムが望ましい場合の課題を明らかにするとともに,厳格な探索行動を促進するために,平均的後悔以上の報酬設計と評価を推奨するものである。

論文の概要: When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

関連論文リスト