Related papers: Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

URL: http://arxiv.org/abs/2112.00882v1
Date: Wed, 1 Dec 2021 23:15:09 GMT
Title: Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes
Authors: Qin Lu and Georgios B. Giannakis
Abstract summary: The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning. The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
Score: 70.80716221080118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning, where a Gaussian process (GP) prior is presumed on the sought value function, and instantaneous rewards are probabilistically generated based on value function evaluations at two consecutive states. Capitalizing on a random feature-based approximant of the GP prior, an online scalable (OS) approach, termed {OS-GPTD}, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in an adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed by upper-bounding the cumulative Bellman error as well as the long-term reward prediction error, relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Moreover, to alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme, termed OS-EGPTD, that can jointly infer the value function, and select interactively the EGP kernel on-the-fly. Finally, performances of the novel OS-(E)GPTD schemes are evaluated on two benchmark problems.

Related papers

Bayesian Optimization for Robust Identification of Ornstein-Uhlenbeck Model [4.0148499400442095]
This paper deals with the identification of the derivation Ornstein-Uhlenbeck (OU) process error model. We put forth a sample-efficient global optimization approach based on the Bayesian optimization framework.
arXiv Detail & Related papers (2025-03-09T01:38:21Z)
On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs) Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy. Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z)
Incremental Ensemble Gaussian Processes [53.3291389385672]
We propose an incremental ensemble (IE-) GP framework, where an EGP meta-learner employs an it ensemble of GP learners, each having a unique kernel belonging to a prescribed kernel dictionary. With each GP expert leveraging the random feature-based approximation to perform online prediction and model update with it scalability, the EGP meta-learner capitalizes on data-adaptive weights to synthesize the per-expert predictions. The novel IE-GP is generalized to accommodate time-varying functions by modeling structured dynamics at the EGP meta-learner and within each GP learner.
arXiv Detail & Related papers (2021-10-13T15:11:25Z)
A Unified Off-Policy Evaluation Approach for General Value Function [131.45028999325797]
General Value Function (GVF) is a powerful tool to represent both predictive and retrospective knowledge in reinforcement learning (RL) In this paper, we propose a new algorithm called GenTD for off-policy GVFs evaluation. We show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function.
arXiv Detail & Related papers (2021-07-06T16:20:34Z)
Adversarial Robustness Guarantees for Gaussian Processes [22.403365399119107]
Gaussian processes (GPs) enable principled computation of model uncertainty, making them attractive for safety-critical applications. We present a framework to analyse adversarial robustness of GPs, defined as invariance of the model's decision to bounded perturbations. We develop a branch-and-bound scheme to refine the bounds and show, for any $epsilon > 0$, that our algorithm is guaranteed to converge to values $epsilon$-close to the actual values in finitely many iterations.
arXiv Detail & Related papers (2021-04-07T15:14:56Z)
Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL) We propose Value Decomposition with Future Prediction (VDFP) We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Sequential Gaussian Processes for Online Learning of Nonstationary Functions [9.997259201098602]
We propose a sequential Monte Carlo algorithm to fit infinite mixtures of GPs that capture non-stationary behavior while allowing for online, distributed inference. Our approach empirically improves performance over state-of-the-art methods for online GP estimation in the presence of non-stationarity in time-series data.
arXiv Detail & Related papers (2019-05-24T02:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.