Theoretical Guarantees of Fictitious Discount Algorithms for Episodic
Reinforcement Learning and Global Convergence of Policy Gradient Methods
- URL: http://arxiv.org/abs/2109.06362v1
- Date: Mon, 13 Sep 2021 23:36:38 GMT
- Title: Theoretical Guarantees of Fictitious Discount Algorithms for Episodic
Reinforcement Learning and Global Convergence of Policy Gradient Methods
- Authors: Xin Guo, Anran Hu, Junzi Zhang
- Abstract summary: A common approach is to introduce a fictitious discount factor and use stationary policies for approximations.
This paper takes the first step towards analyzing these algorithms.
Non-asymptotic convergence guarantees are established for both algorithms.
- Score: 6.7546872379126155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When designing algorithms for finite-time-horizon episodic reinforcement
learning problems, a common approach is to introduce a fictitious discount
factor and use stationary policies for approximations. Empirically, it has been
shown that the fictitious discount factor helps reduce variance, and stationary
policies serve to save the per-iteration computational cost. Theoretically,
however, there is no existing work on convergence analysis for algorithms with
this fictitious discount recipe. This paper takes the first step towards
analyzing these algorithms. It focuses on two vanilla policy gradient (VPG)
variants: the first being a widely used variant with discounted advantage
estimations (DAE), the second with an additional fictitious discount factor in
the score functions of the policy gradient estimators. Non-asymptotic
convergence guarantees are established for both algorithms, and the additional
discount factor is shown to reduce the bias introduced in DAE and thus improve
the algorithm convergence asymptotically. A key ingredient of our analysis is
to connect three settings of Markov decision processes (MDPs): the
finite-time-horizon, the average reward and the discounted settings. To our
best knowledge, this is the first theoretical guarantee on fictitious discount
algorithms for the episodic reinforcement learning of finite-time-horizon MDPs,
which also leads to the (first) global convergence of policy gradient methods
for finite-time-horizon episodic reinforcement learning.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Regret Analysis of Policy Gradient Algorithm for Infinite Horizon
Average Reward Markov Decision Processes [38.879933964474326]
We consider an infinite horizon average reward Markov Decision Process (MDP)
We propose a policy gradient-based algorithm and show its global convergence property.
We prove that the proposed algorithm has $tildemathcalO(T3/4)$ regret.
arXiv Detail & Related papers (2023-09-05T03:22:46Z) - Momentum-Based Policy Gradient with Second-Order Information [40.51117836892182]
We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into gradient descent.
Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process.
Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.
arXiv Detail & Related papers (2022-05-17T11:56:50Z) - Faster Algorithm and Sharper Analysis for Constrained Markov Decision
Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints.
A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual.
This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z) - A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning [13.908826484332282]
We study a new two-time-scale gradient method for solving optimization problems.
Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale gradient algorithm.
We apply our framework to gradient-based policy evaluation algorithms in reinforcement learning.
arXiv Detail & Related papers (2021-09-29T23:15:23Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint [8.087699764574788]
We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning context.
Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme.
arXiv Detail & Related papers (2021-01-06T17:06:42Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.