Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With
Eligibility Trace Under Reward, Policy, and Advantage Feedback
- URL: http://arxiv.org/abs/2109.07054v1
- Date: Wed, 15 Sep 2021 02:29:18 GMT
- Title: Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With
Eligibility Trace Under Reward, Policy, and Advantage Feedback
- Authors: Ishaan Shah, David Halpern, Kavosh Asadi and Michael L. Littman
- Abstract summary: This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback.
We find that COACH can behave sub-optimally for these three feedback types.
We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types.
- Score: 20.089829229666908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fluid human-agent communication is essential for the future of
human-in-the-loop reinforcement learning. An agent must respond appropriately
to feedback from its human trainer even before they have significant experience
working together. Therefore, it is important that learning agents respond well
to various feedback schemes human trainers are likely to provide. This work
analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three
different types of feedback-policy feedback, reward feedback, and advantage
feedback. For these three feedback types, we find that COACH can behave
sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which
we prove converges for all three types. We compare our COACH variant with two
other reinforcement-learning algorithms: Q-learning and TAMER.
Related papers
- Reinforcement Learning from Multi-level and Episodic Human Feedback [1.9686770963118378]
We propose an algorithm to efficiently learn both the reward function and the optimal policy from multi-level human feedback.
We show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
arXiv Detail & Related papers (2025-04-20T20:09:19Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework [13.949126295663328]
We bridge the gap between machine learning and human-computer interaction efforts by developing a shared understanding of human feedback in interactive learning scenarios.
We introduce a taxonomy of feedback types for reward-based learning from human feedback based on nine key dimensions.
We identify seven quality metrics of human feedback influencing both the human ability to express feedback and the agent's ability to learn from the feedback.
arXiv Detail & Related papers (2024-11-18T17:40:42Z) - Dual Active Learning for Reinforcement Learning from Human Feedback [13.732678966515781]
Reinforcement learning from human feedback (RLHF) is widely applied to align large language models with human preferences.
Human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label.
In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem.
arXiv Detail & Related papers (2024-10-03T14:09:58Z) - CANDERE-COACH: Reinforcement Learning from Noisy Feedback [12.232688822099325]
The CANDERE-COACH algorithm is capable of learning from noisy feedback by a nonoptimal teacher.
We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect.
arXiv Detail & Related papers (2024-09-23T20:14:12Z) - Robustifying a Policy in Multi-Agent RL with Diverse Cooperative Behaviors and Adversarial Style Sampling for Assistive Tasks [51.00472376469131]
We propose a framework that learns a robust caregiver's policy by training it for diverse care-receiver responses.
We demonstrate that policies trained with a popular deep RL method are vulnerable to changes in policies of other agents.
arXiv Detail & Related papers (2024-03-01T08:15:18Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - AlpacaFarm: A Simulation Framework for Methods that Learn from Human
Feedback [90.22885814577134]
Large language models (LLMs) have seen widespread adoption due to their strong instruction-following abilities.
We develop a simulator that enables research and development for learning from feedback at a low cost.
We train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data.
arXiv Detail & Related papers (2023-05-22T17:55:50Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs.
We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z) - Accelerating Reinforcement Learning Agent with EEG-based Implicit Human
Feedback [10.138798960466222]
Reinforcement Learning (RL) agents with human feedback can dramatically improve various aspects of learning.
Previous methods require human observer to give inputs explicitly, burdening the human in the loop of RL agent's learning process.
We investigate capturing human's intrinsic reactions as implicit (and natural) feedback through EEG in the form of error-related potentials (ErrP)
arXiv Detail & Related papers (2020-06-30T03:13:37Z) - Facial Feedback for Reinforcement Learning: A Case Study and Offline
Analysis Using the TAMER Framework [51.237191651923666]
We investigate the potential of agent learning from trainers' facial expressions via interpreting them as evaluative feedback.
With designed CNN-RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback.
Our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible.
arXiv Detail & Related papers (2020-01-23T17:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.