Fugu-MT 論文翻訳(概要): Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

論文の概要: Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

arxiv url: http://arxiv.org/abs/2510.27630v2
Date: Mon, 03 Nov 2025 10:53:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-04 14:12:28.037165
Title: Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training
Title（参考訳）: インテリジェンスとしてのインタラクションその2: 長期タスクトレーニングのための非同期ヒューマンエージェントロールアウト
Authors: Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Yang Xiao, Pengfei Liu,
Abstract要約: 我々は非同期なヒューマンガイダンスとアクションレベルのデータフィルタリングを統合するサンプリングフレームワークApolloを紹介する。実験の結果,Apolloはトレーニングされていないベースラインに対して50%以上の改善を達成し,ヒューマンインタラクションを伴わない変異体に対して28%の改善を実現していることがわかった。
参考スコア（独自算出の注目度）: 29.758745480975943
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントは、最近、自動コーディング、ディープリサーチ、グラフィカルユーザインタフェース操作など、ドメインに強い可能性を示している。しかし、長期にわたるドメイン特化タスクを成功させるために彼らを訓練することは依然として困難である。現在の方法は主に2つのカテゴリに分類される。 1つ目は、行動クローニングによる密集した人間のアノテーションに依存している。 2つ目は結果駆動サンプリング(英語版)に依存し、これはドメイン特化タスクにおける正の正の軌道の希薄さによってしばしば崩壊する。我々は非同期なヒューマンガイダンスとアクションレベルのデータフィルタリングを統合するサンプリングフレームワークApolloを紹介する。アノテータがすべてのステップをシャドウする代わりに、Apolloはエージェントが有望な軌道からドリフトしたときのみ、事前の知識や戦略的アドバイスを提供することで介入することができる。この軽量な設計により、30時間以上の相互作用を維持でき、より低コストで貴重な軌道を製造できる。次に、Apolloは監督制御を適用して、サブ最適動作をフィルタリングし、エラーの伝搬を防ぐ。これらのコンポーネントが組み合わさって、長距離環境における信頼性と効果的なデータ収集を可能にする。 Apollo の有効性を示すため,InnovatorBench を用いて評価を行った。 InnovatorBench上でのGLM-4.5モデルのトレーニングに適用した場合,Apolloはトレーニングされていないベースラインに対して50%以上の改善を実現し,ヒューマンインタラクションを伴わないモデルでは28%の改善を実現した。これらの結果は、長期のドメイン特化タスクを扱う上で、ヒト・イン・ザ・ループ・サンプリングとアポロ設計の堅牢性の重要性を強調している。

論文の概要: Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

関連論文リスト