Fugu-MT 論文翻訳(概要): Intrinsic Credit Assignment for Long Horizon Interaction

論文の概要: Intrinsic Credit Assignment for Long Horizon Interaction

arxiv url: http://arxiv.org/abs/2602.12342v1
Date: Thu, 12 Feb 2026 19:00:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-16 23:37:53.724266
Title: Intrinsic Credit Assignment for Long Horizon Interaction
Title（参考訳）: 長期水平相互作用のための固有のクレジットアサインメント
Authors: Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge,
Abstract要約: Belief-RLは、強化学習における純粋に結果に基づく報酬を一貫して上回る情報探索能力を教える。我々の研究は、本質的な信念の報酬を通じて中間行動にクレジットを割り当てることによって、長期にわたる不確実性をナビゲートするためのスケーラブルなトレーニング戦略を導入している。
参考スコア（独自算出の注目度）: 20.67253382614053
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ΔBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for Reinforcement Learning, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards.
Abstract（参考訳）: 長い地平線上で不確実性をナビゲートするために、エージェントをどうやって訓練できるのか? 本研究では,言語モデル固有の信念を活用するΔBelief-RLを提案する。提案手法では,エージェントがターゲットのソリューションに割り当てる確率の変化を利用してクレジットを割り当てる。 ΔBelief-RLは、総合的なインタラクションデータに基づくトレーニングにより、顧客サービスからパーソナライズに至るまでのアウト・オブ・ディストリビューション・アプリケーションに一般化された、強化学習における純粋に結果に基づく報酬を一貫して上回る情報検索能力を教える。特に、トレーニングの地平線を越えてテスト時のインタラクションをスケールするにつれてパフォーマンスが向上し続けており、Pass@kメトリクスでもインタラクション効率が向上しています。全体として、本研究は、内在的なΔBelief報酬を通じて中間行動への信用割当を可能にすることによって、長期にわたる不確実性をナビゲートするためのスケーラブルなトレーニング戦略を導入している。

論文の概要: Intrinsic Credit Assignment for Long Horizon Interaction

関連論文リスト