Fugu-MT 論文翻訳(概要): Towards Monotonic Improvement in In-Context Reinforcement Learning

論文の概要: Towards Monotonic Improvement in In-Context Reinforcement Learning

arxiv url: http://arxiv.org/abs/2509.23209v1
Date: Sat, 27 Sep 2025 09:42:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.105968
Title: Towards Monotonic Improvement in In-Context Reinforcement Learning
Title（参考訳）: インテクスト強化学習における単調な改善に向けて
Authors: Wenhao Zhang, Shao Zhang, Xihuai Wang, Yang Li, Ying Wen,
Abstract要約: In-Context Reinforcement Learning (ICRL)は、新しいタスクに迅速に適応できるエージェントを開発するための有望なパラダイムとして登場した。最近のアプローチでは、オンラインRLからモノトニックポリシー改善データに関する大規模なシーケンスモデルをトレーニングしており、テスト時間のパフォーマンスを継続的に改善することを目指している。学習時間とテスト時間の両方でコンテキスト値を推定する2つの手法を提案する。
参考スコア（独自算出の注目度）: 18.67894044930047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks by leveraging past experiences as context, without updating their parameters. Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance. However, our experimental analysis reveals a critical flaw: these models cannot show a continue improvement like the training data during testing time. Theoretically, we identify this phenomenon as Contextual Ambiguity, where the model's own stochastic actions can generate an interaction history that misleadingly resembles that of a sub-optimal policy from the training data, initiating a vicious cycle of poor action selection. To resolve the Contextual Ambiguity, we introduce Context Value into training phase and propose Context Value Informed ICRL (CV-ICRL). CV-ICRL use Context Value as an explicit signal representing the ideal performance theoretically achievable by a policy given the current context. As the context expands, Context Value could include more task-relevant information, and therefore the ideal performance should be non-decreasing. We prove that the Context Value tightens the lower bound on the performance gap relative to an ideal, monotonically improving policy. We fruther propose two methods for estimating Context Value at both training and testing time. Experiments conducted on the Dark Room and Minigrid testbeds demonstrate that CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments. The source code and data of this paper are available at https://github.com/Bluixe/towards_monotonic_improvement .
Abstract（参考訳）: In-Context Reinforcement Learning (ICRL)は、過去の経験をコンテキストとして活用することで、パラメータを更新することなく、新しいタスクに迅速に適応できるエージェントを開発するための、有望なパラダイムとして登場した。最近のアプローチでは、オンラインRLからモノトニックポリシー改善データに関する大規模なシーケンスモデルをトレーニングしており、テスト時間のパフォーマンスを継続的に改善することを目指している。これらのモデルは、テスト期間中のトレーニングデータのように、継続的な改善を示せません。理論的には、この現象を文脈的曖昧性(Contextual Ambiguity)とみなし、モデル自身の確率的行動は、トレーニングデータから準最適ポリシーに誤って類似した相互作用履歴を発生させ、有害な行動選択のサイクルを開始する。文脈の曖昧さを解決するため、トレーニングフェーズにコンテキスト値を導入し、CV-ICRL(Context Value Informed ICRL)を提案する。 CV-ICRLは、現在の文脈が与えられたポリシーによって理論的に達成可能な理想的な性能を表す明示的な信号としてコンテキスト値を使用する。コンテキストが拡大するにつれて、Context Valueにはより多くのタスク関連情報が含まれる可能性があるため、理想的なパフォーマンスは非減少であるべきです。我々は、コンテキスト値が、理想的な単調に改善されたポリシーに対して、パフォーマンスギャップの低い境界を締め付けることを証明した。トレーニング時間とテスト時間の両方でコンテキスト値を推定する2つの方法を提案する。ダークルームとミニグリッドの実験では、CV-ICRLは性能劣化を効果的に軽減し、様々なタスクや環境にまたがる全体的なICRL能力を改善することが示されている。この論文のソースコードとデータはhttps://github.com/Bluixe/towards_monotonic_improvement で公開されている。

論文の概要: Towards Monotonic Improvement in In-Context Reinforcement Learning

関連論文リスト