A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning
- URL: http://arxiv.org/abs/2304.10951v1
- Date: Fri, 21 Apr 2023 13:43:06 GMT
- Title: A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning
- Authors: Mizhaan Prajit Maniyar, Akash Mondal, Prashanth L.A., Shalabh
Bhatnagar
- Abstract summary: We propose two policy Newton algorithms that incorporate cubic regularization.
Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function.
In particular, the sample complexity of our algorithms to find an $epsilon$-SOSP is $O(epsilon-3.5)$, which is an improvement over the state-of-the-art sample complexity of $O(epsilon-4.5)$.
- Score: 9.628032156001073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of control in the setting of reinforcement learning
(RL), where model information is not available. Policy gradient algorithms are
a popular solution approach for this problem and are usually shown to converge
to a stationary point of the value function. In this paper, we propose two
policy Newton algorithms that incorporate cubic regularization. Both algorithms
employ the likelihood ratio method to form estimates of the gradient and
Hessian of the value function using sample trajectories. The first algorithm
requires an exact solution of the cubic regularized problem in each iteration,
while the second algorithm employs an efficient gradient descent-based
approximation to the cubic regularized problem. We establish convergence of our
proposed algorithms to a second-order stationary point (SOSP) of the value
function, which results in the avoidance of traps in the form of saddle points.
In particular, the sample complexity of our algorithms to find an
$\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the
state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.
Related papers
- Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - Stochastic Dimension-reduced Second-order Methods for Policy
Optimization [11.19708535159457]
We propose several new second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration.
Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem.
We show that DR-SOPO obtains an $mathcalO(epsilon-3.5)$ complexity for reaching approximate first-order stationary condition.
In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $mathcalO
arXiv Detail & Related papers (2023-01-28T12:09:58Z) - PROMPT: Parallel Iterative Algorithm for $\ell_{p}$ norm linear
regression via Majorization Minimization with an application to
semi-supervised graph learning [0.0]
We consider the problem of $ell_p$ norm linear regression, which has several applications such as in sparse recovery, data clustering, and semi-supervised learning.
We propose an iterative algorithm : Parallel IteRative AlgOrithM for $ell_P$ norm regression via MajorizaTion Minimization (PROMPT)
arXiv Detail & Related papers (2021-10-23T10:19:11Z) - An algorithmic view of $\ell_2$ regularization and some path-following
algorithms [7.6146285961466]
We establish an equivalence between the $ell$-regularized solution path for a convex loss function and the solution of an ordinary differentiable equation (ODE)
This equivalence reveals that the solution path can be viewed as the flow of a hybrid of gradient descent and Newton method applying to the empirical loss.
New path-following algorithms based on homotopy methods and numerical ODE solvers are proposed to numerically approximate the solution path.
arXiv Detail & Related papers (2021-07-07T16:00:13Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint [8.087699764574788]
We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning context.
Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme.
arXiv Detail & Related papers (2021-01-06T17:06:42Z) - Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms.
We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis
and Application to Actor-Critic [142.1492359556374]
Bilevel optimization is a class of problems which exhibit a two-level structure.
We propose a two-timescale approximation (TTSA) algorithm for tackling such a bilevel problem.
We show that a two-timescale natural actor-critic policy optimization algorithm can be viewed as a special case of our TTSA framework.
arXiv Detail & Related papers (2020-07-10T05:20:02Z) - Private Stochastic Convex Optimization: Optimal Rates in Linear Time [74.47681868973598]
We study the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions.
A recent work of Bassily et al. has established the optimal bound on the excess population loss achievable given $n$ samples.
We describe two new techniques for deriving convex optimization algorithms both achieving the optimal bound on excess loss and using $O(minn, n2/d)$ gradient computations.
arXiv Detail & Related papers (2020-05-10T19:52:03Z) - Second-order Conditional Gradient Sliding [79.66739383117232]
We present the emphSecond-Order Conditional Gradient Sliding (SOCGS) algorithm.
The SOCGS algorithm converges quadratically in primal gap after a finite number of linearly convergent iterations.
It is useful when the feasible region can only be accessed efficiently through a linear optimization oracle.
arXiv Detail & Related papers (2020-02-20T17:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.