Semi-supervised reward learning for offline reinforcement learning
- URL: http://arxiv.org/abs/2012.06899v1
- Date: Sat, 12 Dec 2020 20:06:15 GMT
- Title: Semi-supervised reward learning for offline reinforcement learning
- Authors: Ksenia Konyushkova, Konrad Zolna, Yusuf Aytar, Alexander Novikov,
Scott Reed, Serkan Cabi, Nando de Freitas
- Abstract summary: Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
- Score: 71.6909757718301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In offline reinforcement learning (RL) agents are trained using a logged
dataset. It appears to be the most natural route to attack real-life
applications because in domains such as healthcare and robotics interactions
with the environment are either expensive or unethical. Training agents usually
requires reward functions, but unfortunately, rewards are seldom available in
practice and their engineering is challenging and laborious. To overcome this,
we investigate reward learning under the constraint of minimizing human reward
annotations. We consider two types of supervision: timestep annotations and
demonstrations. We propose semi-supervised learning algorithms that learn from
limited annotations and incorporate unlabelled data. In our experiments with a
simulated robotic arm, we greatly improve upon behavioural cloning and closely
approach the performance achieved with ground truth rewards. We further
investigate the relationship between the quality of the reward model and the
final policies. We notice, for example, that the reward models do not need to
be perfect to result in useful policies.
Related papers
- Offline Reinforcement Learning with Imputed Rewards [8.856568375969848]
We propose a Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards.
Our results show that, using only 1% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99% of the transitions.
arXiv Detail & Related papers (2024-07-15T15:53:13Z) - Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task.
We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z) - Reward Shaping for Happier Autonomous Cyber Security Agents [0.276240219662896]
One of the most promising directions uses deep reinforcement learning to train autonomous agents in computer network defense tasks.
This work studies the impact of the reward signal that is provided to the agents when training for this task.
arXiv Detail & Related papers (2023-10-20T15:04:42Z) - Handling Sparse Rewards in Reinforcement Learning Using Model Predictive
Control [9.118706387430883]
Reinforcement learning (RL) has recently proven great success in various domains.
Yet, the design of the reward function requires detailed domain expertise and tedious fine-tuning to ensure that agents are able to learn the desired behaviour.
We propose to use model predictive control(MPC) as an experience source for training RL agents in sparse reward environments.
arXiv Detail & Related papers (2022-10-04T11:06:38Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z) - Meta-Reinforcement Learning for Robotic Industrial Insertion Tasks [70.56451186797436]
We study how to use meta-reinforcement learning to solve the bulk of the problem in simulation.
We demonstrate our approach by training an agent to successfully perform challenging real-world insertion tasks.
arXiv Detail & Related papers (2020-04-29T18:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.