Related papers: Examining average and discounted reward optimality criteria in reinforcement learning

Examining average and discounted reward optimality criteria in reinforcement learning

URL: http://arxiv.org/abs/2107.01348v1
Date: Sat, 3 Jul 2021 05:28:56 GMT
Title: Examining average and discounted reward optimality criteria in reinforcement learning
Authors: Vektor Dewanto, Marcus Gallagher
Abstract summary: Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former. While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL.
Score: 4.873362301533825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former. While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for developing the general discounting-free optimality criterion (Veinott, 1969) in RL.

Related papers

PILAF: Optimal Human Preference Sampling for Reward Modeling [14.336058926701432]
We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling. PILAF explicitly aligns preference learning with maximizing the underlying oracle reward.
arXiv Detail & Related papers (2025-02-06T18:09:00Z)
Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning [17.245293915129942]
The optimal objective is a fundamental aspect of reinforcement learning (RL) While total return is ideal, discounted return is practical objective due to its stability. We propose two alternative approaches to align the objectives.
arXiv Detail & Related papers (2024-07-18T08:33:10Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
COPR: Continual Human Preference Learning via Optimal Policy Regularization [56.1193256819677]
Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. We propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory.
arXiv Detail & Related papers (2024-02-22T02:20:08Z)
Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world. It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly. B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion [9.343119070691735]
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL) In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
arXiv Detail & Related papers (2021-06-14T12:12:09Z)
Average-Reward Reinforcement Learning with Trust Region Methods [6.7838662053567615]
We develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. We propose a practical algorithm named Average Policy Optimization (APO) which improves the value estimation with a novel technique named Average Value Constraint.
arXiv Detail & Related papers (2021-06-07T09:19:42Z)
A nearly Blackwell-optimal policy gradient method [4.873362301533825]
We develop a policy gradient method that optimize the gain, then the bias. We propose an algorithm that solves the corresponding bi-level optimization using a logarithmic barrier.
arXiv Detail & Related papers (2021-05-28T06:37:02Z)
Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards [15.082715993594121]
We investigate the problem of learning a policy that treats its users equitably. In this paper, we formulate this novel RL problem, in which an objective function, which encodes a notion of fairness, is optimized. We describe how several classic deep RL algorithms can be adapted to our fair optimization problem.
arXiv Detail & Related papers (2020-08-18T07:17:53Z)
Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks [57.17673320237597]
In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. This paper presents the first reward shaping framework for average-reward learning. It proves that, under standard assumptions, the optimal policy under the original reward function can be recovered.
arXiv Detail & Related papers (2020-07-03T05:06:57Z)
Preference-based Reinforcement Learning with Finite-Time Guarantees [76.88632321436472]
Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning to better elicit human opinion on the target objective. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. We present the first finite-time analysis for general PbRL problems.
arXiv Detail & Related papers (2020-06-16T03:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.