Batch Reinforcement Learning from Crowds
- URL: http://arxiv.org/abs/2111.04279v1
- Date: Mon, 8 Nov 2021 05:46:33 GMT
- Title: Batch Reinforcement Learning from Crowds
- Authors: Guoxi Zhang, Hisashi Kashima
- Abstract summary: A shortcoming of batch reinforcement learning is its requirement for rewards in data.
Existing settings for lack of reward, such as behavioral cloning, rely on optimal demonstrations collected from humans.
This paper addresses the lack of reward in a batch reinforcement learning setting by learning a reward function from preferences.
- Score: 24.717084423091865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A shortcoming of batch reinforcement learning is its requirement for rewards
in data, thus not applicable to tasks without reward functions. Existing
settings for lack of reward, such as behavioral cloning, rely on optimal
demonstrations collected from humans. Unfortunately, extensive expertise is
required for ensuring optimality, which hinder the acquisition of large-scale
data for complex tasks. This paper addresses the lack of reward in a batch
reinforcement learning setting by learning a reward function from preferences.
Generating preferences only requires a basic understanding of a task. Being a
mental process, generating preferences is faster than performing
demonstrations. So preferences can be collected at scale from non-expert humans
using crowdsourcing. This paper tackles a critical challenge that emerged when
collecting data from non-expert humans: the noise in preferences. A novel
probabilistic model is proposed for modelling the reliability of labels, which
utilizes labels collaboratively. Moreover, the proposed model smooths the
estimation with a learned reward function. Evaluation on Atari datasets
demonstrates the effectiveness of the proposed model, followed by an ablation
study to analyze the relative importance of the proposed ideas.
Related papers
- Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - RewardsOfSum: Exploring Reinforcement Learning Rewards for Summarisation [7.0471949371778795]
We propose two reward functions for the task of abstractive summarisation.
The first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update.
The second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward.
arXiv Detail & Related papers (2021-06-08T03:30:50Z) - Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations.
We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories.
Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.