Fugu-MT 論文翻訳(概要): Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

論文の概要: Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.16515v1
Date: Mon, 15 Jun 2026 10:18:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.447015
Title: Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning
Title（参考訳）: オンライン目標強化学習のための構成サブゴールスコーリングによる方向性決定型政策
Authors: Swaminathan S K, Damiya Gondha, Theyanesh Eswaramoorthy Rajahkrishnan, Aritra Hazra,
Abstract要約: ハミルトン・ヤコビ・ベルマン理論(英語版)は、最適のゴール条件の作用は、現在の状態におけるゴール条件距離の勾配によってのみゴールに依存することを示唆している。 Direction-Conditioned Policies (DCP) は,1つのInfoNCE表現を共有する2つのコンポーネントにゴール取得を分解する完全オンライン手法である。
参考スコア（独自算出の注目度）: 1.5282767384702272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.
Abstract（参考訳）: Hamilton-Jacobi-Bellman理論は、最適なゴール条件付きアクションは、現在の状態におけるゴール到達距離の勾配によってのみゴールに依存するが、標準的なオンラインGCRLは、原ゴールでアクターを条件付けしている。 Direction-Conditioned Policies (DCP) は,1つのInfoNCE表現を共有する2つのコンポーネントにゴール取得を分解する完全なオンラインメソッドである。 2つのコンポーネントは共同でトレーニングし、デプロイ時にファクタをクリーンに(サブゴールスコアは削除されるが、方向条件付けは$g$で、$z_t$の代わりに残る)、同じ$(d_t,r_t)$インターフェースで独立した変更を許可する。 3つの結果を証明します。第一に、HJBの下での方向充足:制御-アフィン力学の下での最適作用は、値勾配によってのみゴールに依存する。第二に、学習した表現の穏やかな条件下で、スコアリングルールがオンパス$z_t$を返すと仮定すると、トレーニング時のアクターの条件入力は、表現エラーとジオデシックスラックに一致します。第3に、方向性条件付けが失敗する際の制御可能部分空間の特徴付けである。学習された$$-distanceのランドスケープの質的分析では、コントラスト表現がオンラインの準メトリック符号化環境トポロジーとして振る舞うことが示され、単一障害ケース(AntSoccer)は、理論が予想する学習段階の病理にローカライズされる。

論文の概要: Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

関連論文リスト