Fugu-MT 論文翻訳(概要): EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

論文の概要: EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.17680v1
Date: Tue, 16 Jun 2026 08:48:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.359239
Title: EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
Title（参考訳）: EnvRL:エージェント強化学習における環境ダイナミクスから学ぶ
Authors: Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li,
Abstract要約: 強化学習(RL)は大規模言語モデル(LLM)をエージェントとして訓練するための強力なパラダイムとして登場した。本稿では,環境動態学習をエージェントRLに組み込むフレームワークであるEnvRLを提案する。
参考スコア（独自算出の注目度）: 78.62829041672663
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.
Abstract（参考訳）: 強化学習(RL)は大規模言語モデル(LLM)をエージェントとして訓練するための強力なパラダイムとして登場した。しかしながら、長い水平エージェントタスクに対する従来のRL法は、しばしばスパースな結果報酬に苦しむ。直感的には、ロールアウトインタラクショントラジェクトリに含まれるリッチな環境ダイナミクス情報を見落としている。インタラクション体験は本来、暗黙の監視信号として機能し、環境の基盤となる遷移メカニズムを明らかにし、エージェントが環境のより正確な内部モデルを構築することを可能にする。と。そこで本研究では,政策学習を改善するために,この追加信号を活用する方法について検討する。具体的には,環境力学学習をエージェントRLに組み込むフレームワークであるEnvRLを提案する。主RLの目的と協調的に最適化することにより、エージェントは、環境力学を自身の相互作用経験から内部化することを奨励する。 2つのロングホライゾンエージェントベンチマークの大規模な実験により、EnvRLはRLのみのベースライン、例えばGRPOでトレーニングされた場合、Qwen-2.5-1.5B-インストラクトをALFWorldで72.8%から77.4%、WebShopで56.8%から67.0%に引き上げた。

論文の概要: EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

関連論文リスト