Reward Tweaking: Maximizing the Total Reward While Planning for Short
Horizons
- URL: http://arxiv.org/abs/2002.03327v2
- Date: Tue, 23 Jun 2020 12:45:09 GMT
- Title: Reward Tweaking: Maximizing the Total Reward While Planning for Short
Horizons
- Authors: Chen Tessler and Shie Mannor
- Abstract summary: Reward tweaking learns a surrogate reward function that induces optimal behavior on the original finite-horizon total reward task.
We show that reward tweaking guides the agent towards better long-horizon returns although it plans for short horizons.
- Score: 66.43848057122311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In reinforcement learning, the discount factor $\gamma$ controls the agent's
effective planning horizon. Traditionally, this parameter was considered part
of the MDP; however, as deep reinforcement learning algorithms tend to become
unstable when the effective planning horizon is long, recent works refer to
$\gamma$ as a hyper-parameter -- thus changing the underlying MDP and
potentially leading the agent towards sub-optimal behavior on the original
task. In this work, we introduce \emph{reward tweaking}. Reward tweaking learns
a surrogate reward function $\tilde r$ for the discounted setting that induces
optimal behavior on the original finite-horizon total reward task.
Theoretically, we show that there exists a surrogate reward that leads to
optimality in the original task and discuss the robustness of our approach.
Additionally, we perform experiments in high-dimensional continuous control
tasks and show that reward tweaking guides the agent towards better
long-horizon returns although it plans for short horizons.
Related papers
- Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning [13.429541377715296]
We propose the first computationally efficient algorithm achieving near-optimal regret guarantees in infinite-horizon discounted linear Markov decision processes.
We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $tildemathcalO (sqrtd3 (1 - gamma)- 7 / 2 T)$, where $T$ is the total number of sample transitions, $gamma in (0,1)$ is the discount factor, and $d$ is the feature dimensionality.
arXiv Detail & Related papers (2025-02-19T17:32:35Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Hindsight Reward Tweaking via Conditional Deep Reinforcement Learning [37.61951923445689]
We propose a novel paradigm for deep reinforcement learning to model the influences of reward functions within a near-optimal space.
We demonstrate the feasibility of this approach and study one of its potential application in policy performance boosting with multiple MuJoCo tasks.
arXiv Detail & Related papers (2021-09-06T10:06:48Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.