Fugu-MT 論文翻訳(概要): EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

論文の概要: EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

arxiv url: http://arxiv.org/abs/2510.13220v1
Date: Wed, 15 Oct 2025 07:16:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.540289
Title: EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
Title（参考訳）: EvoTest: 自己改善エージェントシステムのための進化的テスト時間学習
Authors: Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi,
Abstract要約: 現在のAIエージェントの基本的な制限は、テスト時に複雑なスキルをその場で学べないことだ。 EvoTestは,エージェントの微調整や勾配を伴わずにエージェントを改良する,進化的テストタイム学習フレームワークである。
参考スコア（独自算出の注目度）: 59.66823584073748
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
Abstract（参考訳）: 現在のAIエージェントの基本的な制限は、テスト時に複雑なスキルを学習できないことだ。これにより実用性は著しく制限される。この課題について,まずJerricho Test-Time Learning(J-TTL)ベンチマークを導入する。 J-TTLは、エージェントが同じゲームを数回連続してプレイし、そのパフォーマンスを1回から次回に改善しようとする、新たな評価設定である。 J-TTLでは、リフレクション、メモリ、強化学習といった既存の適応手法が用いられている。 EvoTestはエージェントを微調整や勾配なしに改善する進化的テストタイム学習フレームワークで、各エピソードの後にエージェントシステム全体を進化させます。 EvoTestには、ゲームをプレイするアクターエージェントと、エピソードの書き起こしを分析して次の実行用に改訂された設定を提案するEvolver Agentの2つの役割がある。この構成はプロンプトを書き直し、効果的な状態アクションの選択をロギングすることでメモリを更新し、ハイパーパラメータをチューニングし、ツール使用ルーチンを学習する。 J-TTLベンチマークでは、EvoTestは一貫してパフォーマンスを向上し、リフレクションやメモリのみのベースラインだけでなく、より複雑なオンラインファインチューニングメソッドよりも優れています。特に,本手法は2つのゲーム(Detective と Library)に勝てる唯一の方法であり,すべてのベースラインが勝てない。

論文の概要: EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

関連論文リスト