Reinforcement Learning from Multi-level and Episodic Human Feedback
- URL: http://arxiv.org/abs/2504.14732v3
- Date: Fri, 25 Apr 2025 21:54:56 GMT
- Title: Reinforcement Learning from Multi-level and Episodic Human Feedback
- Authors: Muhammad Qasim Elahi, Somtochukwu Oguchienti, Maheed H. Ahmed, Mahsa Ghasemi,
- Abstract summary: We propose an algorithm to efficiently learn both the reward function and the optimal policy from multi-level human feedback.<n>We show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
- Score: 1.9686770963118378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback, expressed as a preference for one behavior over another, to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on the evaluation of an entire episode. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
Related papers
- Reward Learning from Multiple Feedback Types [7.910064218813772]
We show that diverse types of feedback can be utilized and lead to strong reward modeling performance.<n>This work is the first strong indicator of the potential of multi-type feedback for RLHF.
arXiv Detail & Related papers (2025-02-28T13:29:54Z) - Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback.
We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning.
Our proposed method uses reinforcement learning with user intervention signals themselves as rewards.
This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z) - Iterative Reward Shaping using Human Feedback for Correcting Reward
Misspecification [15.453123084827089]
ITERS is an iterative reward shaping approach using human feedback for mitigating the effects of a misspecified reward function.
We evaluate ITERS in three environments and show that it can successfully correct misspecified reward functions.
arXiv Detail & Related papers (2023-08-30T11:45:40Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Generative Inverse Deep Reinforcement Learning for Online Recommendation [62.09946317831129]
We propose a novel inverse reinforcement learning approach, namely InvRec, for online recommendation.
InvRec extracts the reward function from user's behaviors automatically, for online recommendation.
arXiv Detail & Related papers (2020-11-04T12:12:25Z) - Learning Behaviors with Uncertain Human Feedback [26.046639156418223]
We introduce a novel feedback model that considers the uncertainty of human feedback.
Experimental results in both synthetic scenarios and two real-world scenarios with human participants demonstrate the superior performance of our proposed approach.
arXiv Detail & Related papers (2020-06-07T16:51:48Z) - Reinforcement Learning with Feedback Graphs [69.1524391595912]
We study episodic reinforcement learning in decision processes when the agent receives additional feedback per step.
We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage the additional feedback for more sample-efficient learning.
arXiv Detail & Related papers (2020-05-07T22:35:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.