Performance Bounds for Policy-Based Average Reward Reinforcement
Learning Algorithms
- URL: http://arxiv.org/abs/2302.01450v3
- Date: Tue, 27 Jun 2023 20:32:53 GMT
- Title: Performance Bounds for Policy-Based Average Reward Reinforcement
Learning Algorithms
- Authors: Yashaswini Murthy, Mehrdad Moharrami and R. Srikant
- Abstract summary: Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI)
In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large.
In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.
- Score: 11.013390624382259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many policy-based reinforcement learning (RL) algorithms can be viewed as
instantiations of approximate policy iteration (PI), i.e., where policy
improvement and policy evaluation are both performed approximately. In
applications where the average reward objective is the meaningful performance
metric, discounted reward formulations are often used with the discount factor
being close to $1,$ which is equivalent to making the expected horizon very
large. However, the corresponding theoretical bounds for error performance
scale with the square of the horizon. Thus, even after dividing the total
reward by the length of the horizon, the corresponding performance bounds for
average reward problems go to infinity. Therefore, an open problem has been to
obtain meaningful performance bounds for approximate PI and RL algorithms for
the average-reward setting. In this paper, we solve this open problem by
obtaining the first finite-time error bounds for average-reward MDPs, and show
that the asymptotic error goes to zero in the limit as policy evaluation and
policy improvement errors go to zero.
Related papers
- On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm [34.593772931446125]
We propose a primal dual-based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy.
Our proposed algorithm achieves $tildemathcalO(T4/5)$ objective regret and $tildemathcalO(T4/5)$ constraint violation bounds.
arXiv Detail & Related papers (2024-02-03T05:35:58Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Off-Policy Average Reward Actor-Critic with Deterministic Policy Search [3.551625533648956]
We present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion.
We also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm.
We compare the average reward performance of our proposed ARO-DDPG and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
arXiv Detail & Related papers (2023-05-20T17:13:06Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm [42.83837408373223]
We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces.
We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation.
arXiv Detail & Related papers (2022-06-12T22:31:43Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.