Average-Reward Reinforcement Learning with Trust Region Methods
- URL: http://arxiv.org/abs/2106.03442v1
- Date: Mon, 7 Jun 2021 09:19:42 GMT
- Title: Average-Reward Reinforcement Learning with Trust Region Methods
- Authors: Xiaoteng Ma, Xiaohang Tang, Li Xia, Jun Yang, Qianchuan Zhao
- Abstract summary: We develop a unified trust region theory with discounted and average criteria.
With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory.
We propose a practical algorithm named Average Policy Optimization (APO) which improves the value estimation with a novel technique named Average Value Constraint.
- Score: 6.7838662053567615
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most of reinforcement learning algorithms optimize the discounted criterion
which is beneficial to accelerate the convergence and reduce the variance of
estimates. Although the discounted criterion is appropriate for certain tasks
such as financial related problems, many engineering problems treat future
rewards equally and prefer a long-run average criterion. In this paper, we
study the reinforcement learning problem with the long-run average criterion.
Firstly, we develop a unified trust region theory with discounted and average
criteria. With the average criterion, a novel performance bound within the
trust region is derived with the Perturbation Analysis (PA) theory. Secondly,
we propose a practical algorithm named Average Policy Optimization (APO), which
improves the value estimation with a novel technique named Average Value
Constraint. To the best of our knowledge, our work is the first one to study
the trust region approach with the average criterion and it complements the
framework of reinforcement learning beyond the discounted criterion. Finally,
experiments are conducted in the continuous control environment MuJoCo. In most
tasks, APO performs better than the discounted PPO, which demonstrates the
effectiveness of our approach.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Off-Policy Average Reward Actor-Critic with Deterministic Policy Search [3.551625533648956]
We present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion.
We also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) algorithm.
We compare the average reward performance of our proposed ARO-DDPG and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.
arXiv Detail & Related papers (2023-05-20T17:13:06Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Examining average and discounted reward optimality criteria in
reinforcement learning [4.873362301533825]
Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former.
While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting.
Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL.
arXiv Detail & Related papers (2021-07-03T05:28:56Z) - On-Policy Deep Reinforcement Learning for the Average-Reward Criterion [9.343119070691735]
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL)
In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
arXiv Detail & Related papers (2021-06-14T12:12:09Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Average Reward Adjusted Discounted Reinforcement Learning:
Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process.
This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework.
We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.