The Reinforce Policy Gradient Algorithm Revisited
- URL: http://arxiv.org/abs/2310.05000v1
- Date: Sun, 8 Oct 2023 04:05:13 GMT
- Title: The Reinforce Policy Gradient Algorithm Revisited
- Authors: Shalabh Bhatnagar
- Abstract summary: We revisit the Reinforce policy gradient algorithm from the literature.
We propose a major enhancement to the basic algorithm.
We provide a proof of convergence for this new algorithm.
- Score: 7.894349646617293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We revisit the Reinforce policy gradient algorithm from the literature. Note
that this algorithm typically works with cost returns obtained over random
length episodes obtained from either termination upon reaching a goal state (as
with episodic tasks) or from instants of visit to a prescribed recurrent state
(in the case of continuing tasks). We propose a major enhancement to the basic
algorithm. We estimate the policy gradient using a function measurement over a
perturbed parameter by appealing to a class of random search approaches. This
has advantages in the case of systems with infinite state and action spaces as
it relax some of the regularity requirements that would otherwise be needed for
proving convergence of the Reinforce algorithm. Nonetheless, we observe that
even though we estimate the gradient of the performance objective using the
performance objective itself (and not via the sample gradient), the algorithm
converges to a neighborhood of a local minimum. We also provide a proof of
convergence for this new algorithm.
Related papers
- A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning [9.628032156001073]
We propose two policy Newton algorithms that incorporate cubic regularization.
Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function.
In particular, the sample complexity of our algorithms to find an $epsilon$-SOSP is $O(epsilon-3.5)$, which is an improvement over the state-of-the-art sample complexity of $O(epsilon-4.5)$.
arXiv Detail & Related papers (2023-04-21T13:43:06Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Structural Estimation of Markov Decision Processes in High-Dimensional
State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states.
This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized.
We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z) - Random-reshuffled SARAH does not need a full gradient computations [61.85897464405715]
The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
arXiv Detail & Related papers (2021-11-26T06:00:44Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.