A Generalized Projected Bellman Error for Off-policy Value Estimation in
Reinforcement Learning
- URL: http://arxiv.org/abs/2104.13844v1
- Date: Wed, 28 Apr 2021 15:50:34 GMT
- Title: A Generalized Projected Bellman Error for Off-policy Value Estimation in
Reinforcement Learning
- Authors: Andrew Patterson, Adam White, Sina Ghiassian, Martha White
- Abstract summary: We introduce a new generalized PBE, that extends the linear PBE to the nonlinear setting.
We derive an easy-to-use, but sound, algorithm to minimize the generalized objective which is more stable across runs.
- Score: 37.6353054242066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many reinforcement learning algorithms rely on value estimation. However, the
most widely used algorithms -- namely temporal difference algorithms -- can
diverge under both off-policy sampling and nonlinear function approximation.
Many algorithms have been developed for off-policy value estimation which are
sound under linear function approximation, based on the linear mean-squared
projected Bellman error (PBE). Extending these methods to the non-linear case
has been largely unsuccessful. Recently, several methods have been introduced
that approximate a different objective, called the mean-squared Bellman error
(BE), which naturally facilities nonlinear approximation. In this work, we
build on these insights and introduce a new generalized PBE, that extends the
linear PBE to the nonlinear setting. We show how this generalized objective
unifies previous work, including previous theory, and obtain new bounds for the
value error of the solutions of the generalized objective. We derive an
easy-to-use, but sound, algorithm to minimize the generalized objective which
is more stable across runs, is less sensitive to hyperparameters, and performs
favorably across four control domains with neural network function
approximation.
Related papers
- Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics [39.07258580928359]
We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting.
This setting uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR)
Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic.
arXiv Detail & Related papers (2024-06-17T17:52:38Z) - Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions [29.69428894587431]
It is assumed that Bellman holds, which ensures that these regression problems are well-specified.
We give the first particular algorithm for RL under linear Bellman when the number actions is any constant.
arXiv Detail & Related papers (2024-06-17T15:24:49Z) - Regularized Q-Learning with Linear Function Approximation [3.10770247120758]
We consider a single-loop algorithm for minimizing the projected Bellman error with finite time convergence guarantees.
We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z) - A Stochastic Bundle Method for Interpolating Networks [18.313879914379008]
We propose a novel method for training deep neural networks that are capable of driving the empirical loss to zero.
At each iteration our method constructs a maximum linear approximation, known as the bundle of the objective learning approximation.
arXiv Detail & Related papers (2022-01-29T23:02:30Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Approximation Algorithms for Sparse Principal Component Analysis [57.5357874512594]
Principal component analysis (PCA) is a widely used dimension reduction technique in machine learning and statistics.
Various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis.
We present thresholding as a provably accurate, time, approximation algorithm for the SPCA problem.
arXiv Detail & Related papers (2020-06-23T04:25:36Z) - The estimation error of general first order methods [12.472245917779754]
We consider two families of estimation problems: high-dimensional regression and low-dimensional matrix estimation.
We derive lower bounds the error that hold in the high-dimensional optimals in which both the number of observations and the number of parameters diverge.
These lower bounds sense that there exist algorithms whose estimation error matches the lower bounds up to sparseally negligible terms.
arXiv Detail & Related papers (2020-02-28T18:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.