Examining average and discounted reward optimality criteria in
reinforcement learning
- URL: http://arxiv.org/abs/2107.01348v1
- Date: Sat, 3 Jul 2021 05:28:56 GMT
- Title: Examining average and discounted reward optimality criteria in
reinforcement learning
- Authors: Vektor Dewanto, Marcus Gallagher
- Abstract summary: Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former.
While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting.
Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL.
- Score: 4.873362301533825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In reinforcement learning (RL), the goal is to obtain an optimal policy, for
which the optimality criterion is fundamentally important. Two major optimality
criteria are average and discounted rewards, where the later is typically
considered as an approximation to the former. While the discounted reward is
more popular, it is problematic to apply in environments that have no natural
notion of discounting. This motivates us to revisit a) the progression of
optimality criteria in dynamic programming, b) justification for and
complication of an artificial discount factor, and c) benefits of directly
maximizing the average reward. Our contributions include a thorough examination
of the relationship between average and discounted rewards, as well as a
discussion of their pros and cons in RL. We emphasize that average-reward RL
methods possess the ingredient and mechanism for developing the general
discounting-free optimality criterion (Veinott, 1969) in RL.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - COPR: Continual Human Preference Learning via Optimal Policy
Regularization [56.1193256819677]
Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences.
We propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory.
arXiv Detail & Related papers (2024-02-22T02:20:08Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL.
A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly.
B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z) - On-Policy Deep Reinforcement Learning for the Average-Reward Criterion [9.343119070691735]
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL)
In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
arXiv Detail & Related papers (2021-06-14T12:12:09Z) - Average-Reward Reinforcement Learning with Trust Region Methods [6.7838662053567615]
We develop a unified trust region theory with discounted and average criteria.
With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory.
We propose a practical algorithm named Average Policy Optimization (APO) which improves the value estimation with a novel technique named Average Value Constraint.
arXiv Detail & Related papers (2021-06-07T09:19:42Z) - A nearly Blackwell-optimal policy gradient method [4.873362301533825]
We develop a policy gradient method that optimize the gain, then the bias.
We propose an algorithm that solves the corresponding bi-level optimization using a logarithmic barrier.
arXiv Detail & Related papers (2021-05-28T06:37:02Z) - Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning
with Average and Discounted Rewards [15.082715993594121]
We investigate the problem of learning a policy that treats its users equitably.
In this paper, we formulate this novel RL problem, in which an objective function, which encodes a notion of fairness, is optimized.
We describe how several classic deep RL algorithms can be adapted to our fair optimization problem.
arXiv Detail & Related papers (2020-08-18T07:17:53Z) - Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks [57.17673320237597]
In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation.
This paper presents the first reward shaping framework for average-reward learning.
It proves that, under standard assumptions, the optimal policy under the original reward function can be recovered.
arXiv Detail & Related papers (2020-07-03T05:06:57Z) - Preference-based Reinforcement Learning with Finite-Time Guarantees [76.88632321436472]
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning to better elicit human opinion on the target objective.
Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy.
We present the first finite-time analysis for general PbRL problems.
arXiv Detail & Related papers (2020-06-16T03:52:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.