Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With
Eligibility Trace Under Reward, Policy, and Advantage Feedback
- URL: http://arxiv.org/abs/2109.07054v1
- Date: Wed, 15 Sep 2021 02:29:18 GMT
- Title: Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With
Eligibility Trace Under Reward, Policy, and Advantage Feedback
- Authors: Ishaan Shah, David Halpern, Kavosh Asadi and Michael L. Littman
- Abstract summary: This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback.
We find that COACH can behave sub-optimally for these three feedback types.
We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types.
- Score: 20.089829229666908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fluid human-agent communication is essential for the future of
human-in-the-loop reinforcement learning. An agent must respond appropriately
to feedback from its human trainer even before they have significant experience
working together. Therefore, it is important that learning agents respond well
to various feedback schemes human trainers are likely to provide. This work
analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three
different types of feedback-policy feedback, reward feedback, and advantage
feedback. For these three feedback types, we find that COACH can behave
sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which
we prove converges for all three types. We compare our COACH variant with two
other reinforcement-learning algorithms: Q-learning and TAMER.
Related papers
- Robustifying a Policy in Multi-Agent RL with Diverse Cooperative Behaviors and Adversarial Style Sampling for Assistive Tasks [51.00472376469131]
We propose a framework that learns a robust caregiver's policy by training it for diverse care-receiver responses.
We demonstrate that policies trained with a popular deep RL method are vulnerable to changes in policies of other agents.
arXiv Detail & Related papers (2024-03-01T08:15:18Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - AlpacaFarm: A Simulation Framework for Methods that Learn from Human
Feedback [90.22885814577134]
Large language models (LLMs) have seen widespread adoption due to their strong instruction-following abilities.
We develop a simulator that enables research and development for learning from feedback at a low cost.
We train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data.
arXiv Detail & Related papers (2023-05-22T17:55:50Z) - Multi-trainer Interactive Reinforcement Learning System [7.3072544716528345]
We propose a more effective interactive reinforcement learning system by introducing multiple trainers.
In particular, our trainer feedback aggregation experiments show that our aggregation method has the best accuracy.
Finally, we conduct a grid-world experiment to show that the policy trained by the MTIRL with the review model is closer to the optimal policy than that without a review model.
arXiv Detail & Related papers (2022-10-14T18:32:59Z) - Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback [8.409764908043396]
We apply preference modeling and reinforcement learning from human feedback to finetune language models to act as helpful assistants.
We find this alignment training improves performance on almost all NLP evaluations.
We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data.
arXiv Detail & Related papers (2022-04-12T15:02:38Z) - Reinforcement Learning with Feedback from Multiple Humans with Diverse
Skills [1.433758865948252]
A promising approach to improve the robustness and exploration in Reinforcement Learning is collecting human feedback.
It is, however, often too expensive to obtain enough feedback of good quality.
We aim to rely on a group of multiple experts with different skill levels to generate enough feedback.
arXiv Detail & Related papers (2021-11-16T16:19:19Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs.
We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z) - Accelerating Reinforcement Learning Agent with EEG-based Implicit Human
Feedback [10.138798960466222]
Reinforcement Learning (RL) agents with human feedback can dramatically improve various aspects of learning.
Previous methods require human observer to give inputs explicitly, burdening the human in the loop of RL agent's learning process.
We investigate capturing human's intrinsic reactions as implicit (and natural) feedback through EEG in the form of error-related potentials (ErrP)
arXiv Detail & Related papers (2020-06-30T03:13:37Z) - Facial Feedback for Reinforcement Learning: A Case Study and Offline
Analysis Using the TAMER Framework [51.237191651923666]
We investigate the potential of agent learning from trainers' facial expressions via interpreting them as evaluative feedback.
With designed CNN-RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback.
Our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible.
arXiv Detail & Related papers (2020-01-23T17:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.