Fugu-MT 論文翻訳(概要): Inverting the Bellman Equation: From $Q$-Values to World Models

論文の概要: Inverting the Bellman Equation: From $Q$-Values to World Models

arxiv url: http://arxiv.org/abs/2606.21173v1
Date: Fri, 19 Jun 2026 07:26:14 GMT
ステータス: 情報取得中
システム内更新日: 2026-06-23 11:20:49.420502
Title: Inverting the Bellman Equation: From $Q$-Values to World Models
Title（参考訳）: ベルマン方程式の逆転:$Q$-Valuesから世界モデルへ
Authors: Alistair Letcher, Mattie Fellows, Alexander D. Goldie, Jonathan Richens, Jakob N. Foerster, Oliver Richardson,
Abstract要約: 我々は、十分に豊富な報酬関数のセットで訓練された価値に基づくエージェントが、ユニークで正確な世界モデルを暗黙的にエンコードしていることを証明した。 ttReacherエージェントの暗黙の世界モデルにのみ訓練されたポリシーは、位置のみのトレーニングにもかかわらず、分布外、速度に基づく目標に準最適であることがわかった。
参考スコア（独自算出の注目度）: 57.827849584133425
License:
Abstract: Model-based and model-free reinforcement learning are traditionally viewed as separate paradigms: instead of learning a model of the transition kernel $P$, model-free agents typically estimate value functions tied to a specific policy and reward. In this paper, we challenge this dichotomy by proving that value-based agents trained on a sufficiently rich set of reward functions, e.g. using goal-conditioned RL, implicitly encode a unique and accurate world model. To extract this model in practice, we introduce \textit{$P$-learning}, an inverse analogue to $Q$-learning that samples from an agent's $Q$-values, policies and rewards to decode its internal model of the environment. We then provide sufficient conditions on the type and number of goals for which agents encode the true kernel $P$, covering both stochastic and deterministic MDPs over finite or continuous state spaces. Even when our assumptions are violated, we empirically demonstrate that agents trained on a handful of reward functions encode accurate dynamics in $\texttt{Reacher}$, $\texttt{MountainCar}$ and stochastic variants of $\texttt{FourRooms}$. Surprisingly, we find that policies trained exclusively on a \texttt{Reacher} agent's implicit world model are quasi-optimal on out-of-distribution, velocity-based goals despite position-only training -- suggesting that agents contain hidden generalisation capabilities and providing a new lens into the connection between model-based, model-free, and goal-conditioned RL.
Abstract（参考訳）: モデルベースおよびモデルフリー強化学習は、伝統的に別のパラダイムと見なされている: 移行カーネルのモデルを学習する代わりに、モデルフリーエージェントは通常、特定のポリシーと報酬に結びついた値関数を推定する。本稿では、ゴール条件付きRLを用いて十分にリッチな報酬関数のセットで訓練された価値ベースエージェントが、ユニークで正確な世界モデルに暗黙的にエンコードされていることを証明することによって、この二分法に挑戦する。このモデルを実際に抽出するために、エージェントの$Q$-values、ポリシー、報酬からサンプリングした$Q$-learningの逆類似である \textit{$P$-learning} を導入し、環境の内部モデルをデコードする。次に、エージェントが真のカーネル$P$をエンコードし、有限あるいは連続状態空間上の確率的および決定論的 MDP の両方をカバーするような目的のタイプと数について十分な条件を提供する。私たちの仮定に違反しても、いくつかの報酬関数で訓練されたエージェントが$\texttt{Reacher}$, $\texttt{MountainCar}$と$\texttt{FourRooms}$の確率的変種をエンコードしていることを実証的に示します。驚いたことに、エージェントの暗黙の世界モデルに特化して訓練されたポリシーは、位置のみのトレーニングにもかかわらず、分布外、速度に基づく目標に準最適である。

論文の概要: Inverting the Bellman Equation: From $Q$-Values to World Models

関連論文リスト