Reward prediction for representation learning and reward shaping
- URL: http://arxiv.org/abs/2105.03172v1
- Date: Fri, 7 May 2021 11:29:32 GMT
- Title: Reward prediction for representation learning and reward shaping
- Authors: Hlynur Dav\'i{\dh} Hlynsson, Laurenz Wiskott
- Abstract summary: We propose learning a state representation in a self-supervised manner for reward prediction.
We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning.
- Score: 0.8883733362171032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the fundamental challenges in reinforcement learning (RL) is the one
of data efficiency: modern algorithms require a very large number of training
samples, especially compared to humans, for solving environments with
high-dimensional observations. The severity of this problem is increased when
the reward signal is sparse. In this work, we propose learning a state
representation in a self-supervised manner for reward prediction. The reward
predictor learns to estimate either a raw or a smoothed version of the true
reward signal in environment with a single, terminating, goal state. We augment
the training of out-of-the-box RL agents by shaping the reward using our reward
predictor during policy learning. Using our representation for preprocessing
high-dimensional observations, as well as using the predictor for reward
shaping, is shown to significantly enhance Actor Critic using
Kronecker-factored Trust Region and Proximal Policy Optimization in single-goal
environments with visual inputs.
Related papers
- PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models [13.313186665410486]
Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives.
Existing reward finetuning methods are limited by their instability in large-scale prompt datasets.
We propose Proximal Reward Difference Prediction (PRDP) to enable stable black-box reward finetuning for diffusion models.
arXiv Detail & Related papers (2024-02-13T18:58:16Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Learning One Representation to Optimize All Rewards [19.636676744015197]
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process.
It provides explicit near-optimal policies for any reward specified a posteriori.
This is a step towards learning controllable agents in arbitrary black-box environments.
arXiv Detail & Related papers (2021-03-14T15:00:08Z) - Inverse Reinforcement Learning via Matching of Optimality Profiles [2.561053769852449]
We propose an algorithm that learns a reward function from demonstrations of suboptimal or heterogeneous performance.
We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
arXiv Detail & Related papers (2020-11-18T13:23:43Z) - Latent Representation Prediction Networks [0.0]
We find this principle of learning representations unsatisfying.
We propose a new way of jointly learning this representation along with the prediction function.
Our approach is shown to be more sample-efficient than standard reinforcement learning methods.
arXiv Detail & Related papers (2020-09-20T14:26:03Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.