A Unified Off-Policy Evaluation Approach for General Value Function
- URL: http://arxiv.org/abs/2107.02711v1
- Date: Tue, 6 Jul 2021 16:20:34 GMT
- Title: A Unified Off-Policy Evaluation Approach for General Value Function
- Authors: Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang
- Abstract summary: General Value Function (GVF) is a powerful tool to represent both predictive and retrospective knowledge in reinforcement learning (RL)
In this paper, we propose a new algorithm called GenTD for off-policy GVFs evaluation.
We show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function.
- Score: 131.45028999325797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General Value Function (GVF) is a powerful tool to represent both the {\em
predictive} and {\em retrospective} knowledge in reinforcement learning (RL).
In practice, often multiple interrelated GVFs need to be evaluated jointly with
pre-collected off-policy samples. In the literature, the gradient temporal
difference (GTD) learning method has been adopted to evaluate GVFs in the
off-policy setting, but such an approach may suffer from a large estimation
error even if the function approximation class is sufficiently expressive.
Moreover, none of the previous work have formally established the convergence
guarantee to the ground truth GVFs under the function approximation settings.
In this paper, we address both issues through the lens of a class of GVFs with
causal filtering, which cover a wide range of RL applications such as reward
variance, value gradient, cost in anomaly detection, stationary distribution
gradient, etc. We propose a new algorithm called GenTD for off-policy GVFs
evaluation and show that GenTD learns multiple interrelated multi-dimensional
GVFs as efficiently as a single canonical scalar value function. We further
show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to
the ground truth GVFs as long as the function approximation power is
sufficiently large. To our best knowledge, GenTD is the first off-policy GVF
evaluation algorithm that has global optimality guarantee.
Related papers
- Adaptive Exploration for Data-Efficient General Value Function Evaluations [40.156127789708265]
General Value Functions (GVFs) represent predictive knowledge in reinforcement learning.
GVFExplorer learns a single behavior policy that efficiently collects data for evaluating multiple GVFs in parallel.
arXiv Detail & Related papers (2024-05-13T15:24:27Z) - Greedy based Value Representation for Optimal Coordination in
Multi-agent Reinforcement Learning [64.05646120624287]
We derive the expression of the joint Q value function of LVD and MVD.
To ensure optimal consistency, the optimal node is required to be the unique STN.
Our method outperforms state-of-the-art baselines in experiments on various benchmarks.
arXiv Detail & Related papers (2022-11-22T08:14:50Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Affordance as general value function: A computational model [8.34897697233928]
General value functions (GVFs) are long-term predictive summaries of the outcomes of agents following specific policies in the environment.
We show that GVFs realize affordance prediction as a form of direct perception.
We demonstrate that GVFs provide the right framework for learning affordances in real-world applications.
arXiv Detail & Related papers (2020-10-27T13:42:58Z) - When Will Generative Adversarial Imitation Learning Algorithms Attain
Global Convergence [56.40794592158596]
We study generative adversarial imitation learning (GAIL) under general MDP and for nonlinear reward function classes.
This is the first systematic theoretical study of GAIL for global convergence.
arXiv Detail & Related papers (2020-06-24T06:24:37Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z) - Conditional Deep Gaussian Processes: multi-fidelity kernel learning [6.599344783327053]
We propose the conditional DGP model in which the latent GPs are directly supported by the fixed lower fidelity data.
Experiments with synthetic and high dimensional data show comparable performance against other multi-fidelity regression methods.
We conclude that, with the low fidelity data and the hierarchical DGP structure, the effective kernel encodes the inductive bias for true function.
arXiv Detail & Related papers (2020-02-07T14:56:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.