Off-Policy Average Reward Actor-Critic with Deterministic Policy Search
- URL: http://arxiv.org/abs/2305.12239v2
- Date: Wed, 19 Jul 2023 05:32:04 GMT
- Title: Off-Policy Average Reward Actor-Critic with Deterministic Policy Search
- Authors: Naman Saxena, Subhojyoti Khastigir, Shishir Kolathaya, Shalabh
Bhatnagar
- Abstract summary: We present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion.
We also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm.
We compare the average reward performance of our proposed ARO-DDPG and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
- Score: 3.551625533648956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The average reward criterion is relatively less studied as most existing
works in the Reinforcement Learning literature consider the discounted reward
criterion. There are few recent works that present on-policy average reward
actor-critic algorithms, but average reward off-policy actor-critic is
relatively less explored. In this work, we present both on-policy and
off-policy deterministic policy gradient theorems for the average reward
performance criterion. Using these theorems, we also present an Average Reward
Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first
show asymptotic convergence analysis using the ODE-based method. Subsequently,
we provide a finite time analysis of the resulting stochastic approximation
scheme with linear function approximator and obtain an $\epsilon$-optimal
stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We
compare the average reward performance of our proposed ARO-DDPG algorithm and
observe better empirical performance compared to state-of-the-art on-policy
average reward actor-critic algorithms over MuJoCo-based environments.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Performance Bounds for Policy-Based Average Reward Reinforcement
Learning Algorithms [11.013390624382259]
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI)
In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large.
In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.
arXiv Detail & Related papers (2023-02-02T22:37:47Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - On-Policy Deep Reinforcement Learning for the Average-Reward Criterion [9.343119070691735]
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL)
In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
arXiv Detail & Related papers (2021-06-14T12:12:09Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Batch Policy Learning in Average Reward Markov Decision Processes [3.9023554886892438]
Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward.
We develop an optimization algorithm to compute the optimal policy in a parameterized policy class.
The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy.
arXiv Detail & Related papers (2020-07-23T03:28:14Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.