Fugu-MT 論文翻訳(概要): Self-evolving LLM agents with in-distribution Optimization

論文の概要: Self-evolving LLM agents with in-distribution Optimization

arxiv url: http://arxiv.org/abs/2606.07367v1
Date: Fri, 05 Jun 2026 15:09:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.810043
Title: Self-evolving LLM agents with in-distribution Optimization
Title（参考訳）: 分散最適化を用いた自己進化型LLMエージェント
Authors: Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy,
Abstract要約: 大規模言語モデル(LLM)は最近、複雑な環境で対話的なエージェントのための強力なコントローラとして登場した。本稿では,自動プロセス・リワードラベリングとポリシー学習を統一するLDMエージェントの自己進化フレームワークであるQ-Evolveを提案する。我々は,AlfWorld,WebShop,ScienceWorldの手法を評価し,Q-Evolveがサンプル効率,堅牢性,全体的なタスク性能において高いベースラインを達成していることを示す。
参考スコア（独自算出の注目度）: 60.05867547965365
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.
Abstract（参考訳）: 大規模言語モデル(LLM)は最近、複雑な環境で対話型エージェントの強力なコントローラとして登場したが、信頼できる長期的意思決定を行うためのトレーニングは、依然として根本的な課題である。エージェントはエピソードの最後にのみ遅延報酬を受け取ることが多い。本稿では, LLMエージェントの自己進化フレームワークであるQ-Evolveを提案する。提案手法は,各反復において,専門家による実証とエージェント生成トラジェクトリを組み合わせたハイブリッドなオフポリシーデータセットから,インプリシットQ-ラーニングの重み付けによるスパース・リワード設定におけるベルマンバックアップの安定化を学習する。学習された値関数は、利点推定を通じてステップワイズプロセス報酬を導出するために使用され、環境のバックトラックや人間のアノテーションを使わずに、密集した信頼性の高い監視を可能にする。これらの信号を活用することで、プロセス報酬ラベル付けに使用されるデータを介してエージェントを進化させ、分散シフトを悪化させることなく反復的な自己改善を可能にする行動確率ポリシーの最適化を行う。我々は,AlfWorld,WebShop,ScienceWorldの手法を評価し,Q-Evolveがサンプル効率,堅牢性,全体的なタスク性能において高いベースラインを達成していることを示す。以上の結果から,安定エージェントの自己進化はプロセスレベルの監督と政策の共進化を通じて達成可能であることを示す。

論文の概要: Self-evolving LLM agents with in-distribution Optimization

関連論文リスト