Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes
- URL: http://arxiv.org/abs/2112.00882v1
- Date: Wed, 1 Dec 2021 23:15:09 GMT
- Title: Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes
- Authors: Qin Lu and Georgios B. Giannakis
- Abstract summary: The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
- Score: 70.80716221080118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Value function approximation is a crucial module for policy evaluation in
reinforcement learning when the state space is large or continuous. The present
paper takes a generative perspective on policy evaluation via
temporal-difference (TD) learning, where a Gaussian process (GP) prior is
presumed on the sought value function, and instantaneous rewards are
probabilistically generated based on value function evaluations at two
consecutive states. Capitalizing on a random feature-based approximant of the
GP prior, an online scalable (OS) approach, termed {OS-GPTD}, is developed to
estimate the value function for a given policy by observing a sequence of
state-reward pairs. To benchmark the performance of OS-GPTD even in an
adversarial setting, where the modeling assumptions are violated, complementary
worst-case analyses are performed by upper-bounding the cumulative Bellman
error as well as the long-term reward prediction error, relative to their
counterparts from a fixed value function estimator with the entire state-reward
trajectory in hindsight. Moreover, to alleviate the limited expressiveness
associated with a single fixed kernel, a weighted ensemble (E) of GP priors is
employed to yield an alternative scheme, termed OS-EGPTD, that can jointly
infer the value function, and select interactively the EGP kernel on-the-fly.
Finally, performances of the novel OS-(E)GPTD schemes are evaluated on two
benchmark problems.
Related papers
- On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - Incremental Ensemble Gaussian Processes [53.3291389385672]
We propose an incremental ensemble (IE-) GP framework, where an EGP meta-learner employs an it ensemble of GP learners, each having a unique kernel belonging to a prescribed kernel dictionary.
With each GP expert leveraging the random feature-based approximation to perform online prediction and model update with it scalability, the EGP meta-learner capitalizes on data-adaptive weights to synthesize the per-expert predictions.
The novel IE-GP is generalized to accommodate time-varying functions by modeling structured dynamics at the EGP meta-learner and within each GP learner.
arXiv Detail & Related papers (2021-10-13T15:11:25Z) - A Unified Off-Policy Evaluation Approach for General Value Function [131.45028999325797]
General Value Function (GVF) is a powerful tool to represent both predictive and retrospective knowledge in reinforcement learning (RL)
In this paper, we propose a new algorithm called GenTD for off-policy GVFs evaluation.
We show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function.
arXiv Detail & Related papers (2021-07-06T16:20:34Z) - Adversarial Robustness Guarantees for Gaussian Processes [22.403365399119107]
Gaussian processes (GPs) enable principled computation of model uncertainty, making them attractive for safety-critical applications.
We present a framework to analyse adversarial robustness of GPs, defined as invariance of the model's decision to bounded perturbations.
We develop a branch-and-bound scheme to refine the bounds and show, for any $epsilon > 0$, that our algorithm is guaranteed to converge to values $epsilon$-close to the actual values in finitely many iterations.
arXiv Detail & Related papers (2021-04-07T15:14:56Z) - Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL)
We propose Value Decomposition with Future Prediction (VDFP)
We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Sequential Gaussian Processes for Online Learning of Nonstationary
Functions [9.997259201098602]
We propose a sequential Monte Carlo algorithm to fit infinite mixtures of GPs that capture non-stationary behavior while allowing for online, distributed inference.
Our approach empirically improves performance over state-of-the-art methods for online GP estimation in the presence of non-stationarity in time-series data.
arXiv Detail & Related papers (2019-05-24T02:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.