Reinforcement Learning with Feedback from Multiple Humans with Diverse
Skills
- URL: http://arxiv.org/abs/2111.08596v1
- Date: Tue, 16 Nov 2021 16:19:19 GMT
- Title: Reinforcement Learning with Feedback from Multiple Humans with Diverse
Skills
- Authors: Taku Yamagata, Ryan McConville and Raul Santos-Rodriguez (Department
of Engineering Mathematics, University of Bristol)
- Abstract summary: A promising approach to improve the robustness and exploration in Reinforcement Learning is collecting human feedback.
It is, however, often too expensive to obtain enough feedback of good quality.
We aim to rely on a group of multiple experts with different skill levels to generate enough feedback.
- Score: 1.433758865948252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A promising approach to improve the robustness and exploration in
Reinforcement Learning is collecting human feedback and that way incorporating
prior knowledge of the target environment. It is, however, often too expensive
to obtain enough feedback of good quality. To mitigate the issue, we aim to
rely on a group of multiple experts (and non-experts) with different skill
levels to generate enough feedback. Such feedback can therefore be inconsistent
and infrequent. In this paper, we build upon prior work -- Advise, a Bayesian
approach attempting to maximise the information gained from human feedback --
extending the algorithm to accept feedback from this larger group of humans,
the trainers, while also estimating each trainer's reliability. We show how
aggregating feedback from multiple trainers improves the total feedback's
accuracy and make the collection process easier in two ways. Firstly, this
approach addresses the case of some of the trainers being adversarial.
Secondly, having access to the information about each trainer reliability
provides a second layer of robustness and offers valuable information for
people managing the whole system to improve the overall trust in the system. It
offers an actionable tool for improving the feedback collection process or
modifying the reward function design if needed. We empirically show that our
approach can accurately learn the reliability of each trainer correctly and use
it to maximise the information gained from the multiple trainers' feedback,
even if some of the sources are adversarial.
Related papers
- CANDERE-COACH: Reinforcement Learning from Noisy Feedback [12.232688822099325]
The CANDERE-COACH algorithm is capable of learning from noisy feedback by a nonoptimal teacher.
We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect.
arXiv Detail & Related papers (2024-09-23T20:14:12Z) - ExpertAF: Expert Actionable Feedback from Video [81.46431188306397]
We introduce a novel method to generate actionable feedback from video of a person doing a physical activity.
Our method takes a video demonstration and its accompanying 3D body pose and generates expert commentary.
Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching.
arXiv Detail & Related papers (2024-08-01T16:13:07Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback.
We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z) - Continual Learning for Instruction Following from Realtime Feedback [23.078048024461264]
We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions.
During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions.
We design a contextual bandit learning approach, converting user feedback to immediate reward.
We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time.
arXiv Detail & Related papers (2022-12-19T18:39:43Z) - Multi-trainer Interactive Reinforcement Learning System [7.3072544716528345]
We propose a more effective interactive reinforcement learning system by introducing multiple trainers.
In particular, our trainer feedback aggregation experiments show that our aggregation method has the best accuracy.
Finally, we conduct a grid-world experiment to show that the policy trained by the MTIRL with the review model is closer to the optimal policy than that without a review model.
arXiv Detail & Related papers (2022-10-14T18:32:59Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users'
Feedback [62.997667081978825]
We present a novel approach for considering user feedback and evaluate it using three distinct strategies.
Despite a limited number of feedbacks returned by users (as low as 20% of the total), our approach obtains similar results to those of state of the art approaches.
arXiv Detail & Related papers (2020-09-16T07:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.