Related papers: Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning

Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning

URL: http://arxiv.org/abs/2407.13279v2
Date: Tue, 18 Mar 2025 07:27:15 GMT
Title: Analyzing and Bridging the Gap between Maximizing Total Reward and Discounted Reward in Deep Reinforcement Learning
Authors: Shuyu Yin, Fei Wen, Peilin Liu, Tao Luo,
Abstract summary: The optimal objective is a fundamental aspect of reinforcement learning (RL)<n>While total return is ideal, discounted return is practical objective due to its stability.<n>We propose two alternative approaches to align the objectives.
Score: 17.245293915129942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The optimal objective is a fundamental aspect of reinforcement learning (RL), as it determines how policies are evaluated and optimized. While total return maximization is the ideal objective in RL, discounted return maximization is the practical objective due to its stability. This can lead to a misalignment of objectives. To better understand the problem, we theoretically analyze the performance gap between the policy maximizes the total return and the policy maximizes the discounted return. Our analysis reveals that increasing the discount factor can be ineffective at eliminating this gap when environment contains cyclic states,a frequent scenario. To address this issue, we propose two alternative approaches to align the objectives. The first approach achieves alignment by modifying the terminal state value, treating it as a tunable hyper-parameter with its suitable range defined through theoretical analysis. The second approach focuses on calibrating the reward data in trajectories, enabling alignment in practical Deep RL applications using off-policy algorithms. This method enhances robustness to the discount factor and improve performance when the trajectory length is large. Our proposed methods demonstrate that adjusting reward data can achieve alignment, providing an insight that can be leveraged to design new optimization objectives to fundamentally enhance the performance of RL algorithms.

Related papers

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets [6.5472155063246085]
Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. We propose Adrial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training.
arXiv Detail & Related papers (2024-10-28T07:04:32Z)
Reinforcement Learning with LTL and $ω$-Regular Objectives via Optimality-Preserving Translation to Average Rewards [43.816375964005026]
Linear temporal logic (LTL) and, more generally, $omega$-regular objectives are alternatives to the traditional discount sum and average reward objectives in reinforcement learning. We show that each RL problem for $omega$-regular objectives can be reduced to a limit-average reward problem in an optimality-preserving fashion.
arXiv Detail & Related papers (2024-10-16T02:42:37Z)
Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-task Optimum Comparator [9.900800253949512]
We develop a bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation. We empirically validate the correctness of the derived upper bounds and demonstrate the superior effectiveness of the proposed algorithm over benchmarks.
arXiv Detail & Related papers (2024-10-13T05:17:58Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems. The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback. We consider the general reward setting where the reward can be defined over the whole trajectory. We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
RL-PGO: Reinforcement Learning-based Planar Pose-Graph Optimization [1.4884785898657995]
This paper presents a state-of-the-art Deep Reinforcement Learning (DRL) based environment and proposed agent for 2D pose-graph optimization. We demonstrate that the pose-graph optimization problem can be modeled as a partially observable Decision Process and evaluate performance on real-world and synthetic datasets.
arXiv Detail & Related papers (2022-02-26T20:10:14Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z)
Regularized Policies are Reward Robust [33.05828095421357]
We study the effects of regularization of policies in Reinforcement Learning (RL) We find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward. Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large.
arXiv Detail & Related papers (2021-01-18T11:38:47Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
Batch Policy Learning in Average Reward Markov Decision Processes [3.9023554886892438]
Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We develop an optimization algorithm to compute the optimal policy in a parameterized policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy.
arXiv Detail & Related papers (2020-07-23T03:28:14Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.