Fugu-MT 論文翻訳(概要): ECHO: Terminal Agents Learn World Models for Free

論文の概要: ECHO: Terminal Agents Learn World Models for Free

arxiv url: http://arxiv.org/abs/2605.24517v1
Date: Sat, 23 May 2026 11:08:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.153791
Title: ECHO: Terminal Agents Learn World Models for Free
Title（参考訳）: ECHO: ターミナルエージェントは無償で世界モデルを学ぶ
Authors: Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, Dimitris Papailiopoulos,
Abstract要約: ECHO (Environment Cross-Entropy Hybrid Objective) は、アクショントークンに対する標準的なポリシー段階の損失と、環境観測トークンを予測するためのポリシーを訓練する補助的な損失とを組み合わせたハイブリッドな目的である。 ECHOは、発生しない軌道であっても、ターミナルダイナミクスをより正確に予測するポリシーを生成する。
参考スコア（独自算出の注目度）: 13.305830192059625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.
Abstract（参考訳）: モデルがコマンドを出力し、端末がそれを実行し、返されるストリーム -- stdout、エラー、ファイル、ログ、トレース -- が結果を記録する。 GRPOスタイルのトレーニングは、すでにロールアウト中の環境応答を無視しながら、少ない結果レベルの報酬でアクショントークンを更新します。失敗に終わったロールアウトは、環境がどのように反応するかについての豊富な証拠を含むにもかかわらず、政策の緩やかなシグナルをほとんど提供しない。環境横断型ハイブリッドオブジェクト(ECHO:Environment Cross-Entropy Hybrid Objective)は,アクショントークンに対する標準ポリシー段階の損失と,そのアクションから生じる環境観測トークンを予測するためのポリシーを訓練する補助的損失とを組み合わせたハイブリッド目的である。 ECHOはGRPOと同じ前方パスを再利用し、追加のロールアウトを必要としない。 ECHOはGRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves 2.70% to 5.17%, Qwen3-14B to 5.17% to 10.79%である。 ECHOはまた、それらが生成しなかった軌道であっても、ターミナルダイナミクスをより正確に予測するポリシーも生成している。 Qwen3-8B ベースから、ECHO は専門家のデモ無しで、専門家-SFT-then-GRPO のパフォーマンスと一致し、ターミナルベンチ2.0 のエキスパート-SFT 初期化の利点のおよそ半分を回復する。いくつかの設定では、環境予測損失だけで検証不要な自己改善が可能であり、環境相互作用のみから学習することで、未確認のOODタスクを改善することができる。これらの結果は、環境観測は将来の行動の文脈に留まらず、あらゆるロールアウトに既に存在する密集した政治監視信号であることを示している。

論文の概要: ECHO: Terminal Agents Learn World Models for Free

関連論文リスト