CANDERE-COACH: Reinforcement Learning from Noisy Feedback
- URL: http://arxiv.org/abs/2409.15521v1
- Date: Mon, 23 Sep 2024 20:14:12 GMT
- Title: CANDERE-COACH: Reinforcement Learning from Noisy Feedback
- Authors: Yuxuan Li, Srijita Das, Matthew E. Taylor,
- Abstract summary: The CANDERE-COACH algorithm is capable of learning from noisy feedback by a nonoptimal teacher.
We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect.
- Score: 12.232688822099325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, Reinforcement learning (RL) has been widely applied to many challenging tasks. However, in order to perform well, it requires access to a good reward function which is often sparse or manually engineered with scope for error. Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher's (positive or negative) evaluation of the agent's action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a nonoptimal teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect. Experiments on three common domains demonstrate the effectiveness of the proposed approach.
Related papers
- Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback.
We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z) - Improving the Validity of Automatically Generated Feedback via
Reinforcement Learning [50.067342343957876]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)
Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning.
Our proposed method uses reinforcement learning with user intervention signals themselves as rewards.
This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z) - Active Reward Learning from Multiple Teachers [17.10187575303075]
Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an AI system.
This human feedback is often a preference comparison, in which the human teacher compares several samples of AI behavior and chooses which they believe best accomplishes the objective.
While reward learning typically assumes that all feedback comes from a single teacher, in practice these systems often query multiple teachers to gather sufficient training data.
arXiv Detail & Related papers (2023-03-02T01:26:53Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Methodical Advice Collection and Reuse in Deep Reinforcement Learning [12.840744403432547]
This work considers how to better leverage uncertainties about when a student should ask for advice and if the student can model the teacher to ask for less advice.
Our empirical results show that using dual uncertainties to drive advice collection and reuse may improve learning performance across several Atari games.
arXiv Detail & Related papers (2022-04-14T22:24:55Z) - Learning Robust Recommender from Noisy Implicit Feedback [140.7090392887355]
We propose a new training strategy named Adaptive Denoising Training (ADT)
ADT adaptively prunes the noisy interactions by two paradigms (i.e., Truncated Loss and Reweighted Loss)
We consider extra feedback (e.g., rating) as auxiliary signal and propose three strategies to incorporate extra feedback into ADT.
arXiv Detail & Related papers (2021-12-02T12:12:02Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Learning Online from Corrective Feedback: A Meta-Algorithm for Robotics [24.863665993509997]
A key challenge in Imitation Learning (IL) is that optimal state actions demonstrations are difficult for the teacher to provide.
As an alternative to state action demonstrations, the teacher can provide corrective feedback such as their preferences or rewards.
We show that our approach can learn quickly from a variety of noisy feedback.
arXiv Detail & Related papers (2021-04-02T12:42:12Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.