Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction
- URL: http://arxiv.org/abs/2103.02225v1
- Date: Wed, 3 Mar 2021 07:28:56 GMT
- Title: Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction
- Authors: Hongyao Tang, Jianye Hao, Guangyong Chen, Pengfei Chen, Chen Chen,
Yaodong Yang, Luo Zhang, Wulong Liu, Zhaopeng Meng
- Abstract summary: Value function is the central notion of Reinforcement Learning (RL)
We propose Value Decomposition with Future Prediction (VDFP)
We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
- Score: 37.06232589005015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Value function is the central notion of Reinforcement Learning (RL). Value
estimation, especially with function approximation, can be challenging since it
involves the stochasticity of environmental dynamics and reward signals that
can be sparse and delayed in some cases. A typical model-free RL algorithm
usually estimates the values of a policy by Temporal Difference (TD) or Monte
Carlo (MC) algorithms directly from rewards, without explicitly taking dynamics
into consideration. In this paper, we propose Value Decomposition with Future
Prediction (VDFP), providing an explicit two-step understanding of the value
estimation process: 1) first foresee the latent future, 2) and then evaluate
it. We analytically decompose the value function into a latent future dynamics
part and a policy-independent trajectory return part, inducing a way to model
latent dynamics and returns separately in value estimation. Further, we derive
a practical deep RL algorithm, consisting of a convolutional model to learn
compact trajectory representation from past experiences, a conditional
variational auto-encoder to predict the latent future dynamics and a convex
return model that evaluates trajectory representation. In experiments, we
empirically demonstrate the effectiveness of our approach for both off-policy
and on-policy RL in several OpenAI Gym continuous control tasks as well as a
few challenging variants with delayed reward.
Related papers
- Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - A Bayesian Approach to Robust Inverse Reinforcement Learning [54.24816623644148]
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL)
The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics.
Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed to have a highly accurate model of the environment.
arXiv Detail & Related papers (2023-09-15T17:37:09Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Evaluating Pedestrian Trajectory Prediction Methods with Respect to Autonomous Driving [0.9217021281095907]
In this paper, we assess the state of the art in pedestrian trajectory prediction within the context of generating single trajectories.
The evaluation is conducted on the widely-used ETH/UCY dataset where the Average Displacement Error (ADE) and the Final Displacement Error (FDE) are reported.
arXiv Detail & Related papers (2023-08-09T19:21:50Z) - Model-Based Offline Reinforcement Learning with Pessimism-Modulated
Dynamics Belief [3.0036519884678894]
Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model.
In this work, we maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief.
We show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief.
arXiv Detail & Related papers (2022-10-13T03:14:36Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Generative Temporal Difference Learning for Infinite-Horizon Prediction [101.59882753763888]
We introduce the $gamma$-model, a predictive model of environment dynamics with an infinite probabilistic horizon.
We discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors.
arXiv Detail & Related papers (2020-10-27T17:54:12Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.