On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
- URL: http://arxiv.org/abs/2106.07329v1
- Date: Mon, 14 Jun 2021 12:12:09 GMT
- Title: On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
- Authors: Yiming Zhang, Keith W. Ross
- Abstract summary: We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL)
In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
- Score: 9.343119070691735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We develop theory and algorithms for average-reward on-policy Reinforcement
Learning (RL). We first consider bounding the difference of the long-term
average reward for two policies. We show that previous work based on the
discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a
non-meaningful bound in the average-reward setting. By addressing the
average-reward criterion directly, we then derive a novel bound which depends
on the average divergence between the two policies and Kemeny's constant. Based
on this bound, we develop an iterative procedure which produces a sequence of
monotonically improved policies for the average reward criterion. This
iterative procedure can then be combined with classic DRL (Deep Reinforcement
Learning) methods, resulting in practical DRL algorithms that target the
long-run average reward criterion. In particular, we demonstrate that
Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the
average-reward criterion, significantly outperforms TRPO in the most
challenging MuJuCo environments.
Related papers
- WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Off-Policy Average Reward Actor-Critic with Deterministic Policy Search [3.551625533648956]
We present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion.
We also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm.
We compare the average reward performance of our proposed ARO-DDPG and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
arXiv Detail & Related papers (2023-05-20T17:13:06Z) - Performance Bounds for Policy-Based Average Reward Reinforcement
Learning Algorithms [11.013390624382259]
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI)
In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large.
In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.
arXiv Detail & Related papers (2023-02-02T22:37:47Z) - Combing Policy Evaluation and Policy Improvement in a Unified
f-Divergence Framework [33.90259939664709]
We study the f-divergence between learning policy and sampling policy and derive a novel DRL framework, termed f-Divergence Reinforcement Learning (FRL)
The FRL framework achieves two advantages: (1) policy evaluation and policy improvement processes are derived simultaneously by f-divergence; (2) overestimation issue of value function are alleviated.
arXiv Detail & Related papers (2021-09-24T10:20:46Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Average-Reward Reinforcement Learning with Trust Region Methods [6.7838662053567615]
We develop a unified trust region theory with discounted and average criteria.
With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory.
We propose a practical algorithm named Average Policy Optimization (APO) which improves the value estimation with a novel technique named Average Value Constraint.
arXiv Detail & Related papers (2021-06-07T09:19:42Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.