Fugu-MT 論文翻訳(概要): Test-Time Deep Thinking to Explore Implicit Rules

論文の概要: Test-Time Deep Thinking to Explore Implicit Rules

arxiv url: http://arxiv.org/abs/2605.24828v2
Date: Sun, 31 May 2026 15:58:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 18:24:16.336229
Title: Test-Time Deep Thinking to Explore Implicit Rules
Title（参考訳）: 暗黙のルールを探求するテストタイムディープ思考
Authors: Wentong Chen, Xin Cong, Zhong Zhang, Yaxi Lu, Siyuan Zhao, Yesai Wu, Qinyu Luo, Haotian Chen, Yankai Lin, Zhiyuan Liu, Maosong Sun,
Abstract要約: Test-Time Exploration (TTExplore) は、思考者コンポーネントがインタラクション履歴を分析して暗黙のルールを推論し、アクターを誘導するフレームワークである。 5つのテキストベースのエボダイドタスクの実験では、TTExploreにExp-Thinkerが搭載されており、ベースラインエージェントのパフォーマンスを平均14ドル～19ドルポイント改善している。
参考スコア（独自算出の注目度）: 80.74526536918196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.
Abstract（参考訳）: LLM(Large Language Models)の継続的な進歩に伴い、インテリジェントエージェントはますます重要になりつつある。しかしながら、これらのエージェントは暗黙の規則によって統治される環境で失敗することが多い。これによりエージェントは繰り返し試行錯誤ループに陥り、最終的にタスクの失敗につながる。この課題に対処するために,思考者コンポーネントがインタラクション履歴を分析してこれらの暗黙の規則を推論し,アクターを誘導するフレームワークであるTTExploreを提案する。この設定における効果的な探索は、思考者の推論能力に大きく依存する。しかし、深い推論軌道の評価は本質的に不安定で難しいため、効果的な訓練には大きな障害となる。この問題を克服するために、我々は新しく安定した強化学習パイプラインを導入する。中心となる考え方は、正確なタスクレベルスコアを間接報酬として使用し、中間的推論を評価することの難しさを回避し、軌道毎の思考ノードを1つだけ保持し、報酬の分散を軽減することである。このパイプラインを使用して、特殊な7BモデルであるExp-Thinkerをトレーニングします。 5つのテキストベースのエボダイドタスクの実験では、TTExploreにExp-Thinkerが搭載されているため、平均14ドル～19ドルポイントのベースラインエージェントのパフォーマンスが向上し、暗黙の規則を明示的に推論する効果が示された。

論文の概要: Test-Time Deep Thinking to Explore Implicit Rules

関連論文リスト