Related papers: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

URL: http://arxiv.org/abs/2302.01450v3
Date: Tue, 27 Jun 2023 20:32:53 GMT
Title: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms
Authors: Yashaswini Murthy, Mehrdad Moharrami and R. Srikant
Abstract summary: Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI) In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.
Score: 11.013390624382259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the asymptotic error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.

Related papers

Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift. We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z)
On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs) Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z)
Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm [34.593772931446125]
We propose a primal dual-based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy. Our proposed algorithm achieves $tildemathcalO(T4/5)$ objective regret and $tildemathcalO(T4/5)$ constraint violation bounds.
arXiv Detail & Related papers (2024-02-03T05:35:58Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
Off-Policy Average Reward Actor-Critic with Deterministic Policy Search [3.551625533648956]
We present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. We also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm. We compare the average reward performance of our proposed ARO-DDPG and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
arXiv Detail & Related papers (2023-05-20T17:13:06Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm [42.83837408373223]
We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces. We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation.
arXiv Detail & Related papers (2022-06-12T22:31:43Z)
Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy. Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.