Fugu-MT 論文翻訳(概要): Is Temporal Difference Learning the Gold Standard for Stitching in RL?

論文の概要: Is Temporal Difference Learning the Gold Standard for Stitching in RL?

arxiv url: http://arxiv.org/abs/2510.21995v1
Date: Fri, 24 Oct 2025 20:00:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:14.726304
Title: Is Temporal Difference Learning the Gold Standard for Stitching in RL?
Title（参考訳）: 時相差学習はRLにおけるスティッチの標準となるか?
Authors: Michał Bortkiewicz, Władysław Pałucki, Mateusz Ostaszewski, Benjamin Eysenbach,
Abstract要約: 本稿では, 従来の縫合術の知恵が, 関数近似を用いた環境において実際に保持されているかを検討する。我々はモンテカルロ法(MC)が経験的縫合を達成できることを実証的に実証した。批判能力の増大はMC法とTD法の両方の一般化ギャップを効果的に減少させることがわかった。
参考スコア（独自算出の注目度）: 27.801632071235897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This experience stitching capability is often viewed as the purview of temporal difference (TD) methods. However, outside of small tabular settings, trajectories never intersect, calling into question this conventional wisdom. Moreover, the common belief is that Monte Carlo (MC) methods should not be able to recombine experience, yet it remains unclear whether function approximation could result in a form of implicit stitching. The goal of this paper is to empirically study whether the conventional wisdom about stitching actually holds in settings where function approximation is used. We empirically demonstrate that Monte Carlo (MC) methods can also achieve experience stitching. While TD methods do achieve slightly stronger capabilities than MC methods (in line with conventional wisdom), that gap is significantly smaller than the gap between small and large neural networks (even on quite simple tasks). We find that increasing critic capacity effectively reduces the generalization gap for both the MC and TD methods. These results suggest that the traditional TD inductive bias for stitching may be less necessary in the era of large models for RL and, in some cases, may offer diminishing returns. Additionally, our results suggest that stitching, a form of generalization unique to the RL setting, might be achieved not through specialized algorithms (temporal difference learning) but rather through the same recipe that has provided generalization in other machine learning settings (via scale). Project website: https://michalbortkiewicz.github.io/golden-standard/
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、トレーニングデータに行動の短い断片しか含まれていない場合でも、長期的タスクを解決することを約束する。この経験的縫合能力は、しばしば時間差分法(TD)のパービューと見なされる。しかし、小さな表の設定以外では、軌跡は決して交わらず、この従来の知恵に疑問を投げかける。さらに、モンテカルロ法(MC)は経験を再結合できないが、関数近似が暗黙の縫合の形になるかどうかは不明である。本研究の目的は, 従来の縫合術の知恵が, 機能近似を用いた環境において実際に保持されているかどうかを実証的に検討することである。我々はモンテカルロ法(MC)が経験的縫合を達成できることを実証的に実証した。 TD法は(従来の知恵に則って)MC法よりもわずかに強力な能力を達成するが、そのギャップは(非常に単純なタスクであっても)小さなニューラルネットワークと大きなニューラルネットワークのギャップよりも著しく小さい。批判能力の増大はMC法とTD法の両方の一般化ギャップを効果的に減少させることがわかった。これらの結果は, 縫合における従来のTDインダクティブバイアスは, RLの大型モデルでは不要であり, 場合によってはリターンが低下する可能性があることを示唆している。さらに,RL設定に固有の一般化形式である縫合は,特殊なアルゴリズム(時間差分学習)ではなく,他の機械学習設定で一般化されたレシピ(スケール)によって実現される可能性が示唆された。プロジェクトウェブサイト: https://michalbortkiewicz.github.io/golden-standard/

論文の概要: Is Temporal Difference Learning the Gold Standard for Stitching in RL?

関連論文リスト