Learning Behaviors with Uncertain Human Feedback
- URL: http://arxiv.org/abs/2006.04201v1
- Date: Sun, 7 Jun 2020 16:51:48 GMT
- Title: Learning Behaviors with Uncertain Human Feedback
- Authors: Xu He, Haipeng Chen and Bo An
- Abstract summary: We introduce a novel feedback model that considers the uncertainty of human feedback.
Experimental results in both synthetic scenarios and two real-world scenarios with human participants demonstrate the superior performance of our proposed approach.
- Score: 26.046639156418223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human feedback is widely used to train agents in many domains. However,
previous works rarely consider the uncertainty when humans provide feedback,
especially in cases that the optimal actions are not obvious to the trainers.
For example, the reward of a sub-optimal action can be stochastic and sometimes
exceeds that of the optimal action, which is common in games or real-world.
Trainers are likely to provide positive feedback to sub-optimal actions,
negative feedback to the optimal actions and even do not provide feedback in
some confusing situations. Existing works, which utilize the Expectation
Maximization (EM) algorithm and treat the feedback model as hidden parameters,
do not consider uncertainties in the learning environment and human feedback.
To address this challenge, we introduce a novel feedback model that considers
the uncertainty of human feedback. However, this incurs intractable calculus in
the EM algorithm. To this end, we propose a novel approximate EM algorithm, in
which we approximate the expectation step with the Gradient Descent method.
Experimental results in both synthetic scenarios and two real-world scenarios
with human participants demonstrate the superior performance of our proposed
approach.
Related papers
- Reinforcement Learning from Multi-level and Episodic Human Feedback [1.9686770963118378]
We propose an algorithm to efficiently learn both the reward function and the optimal policy from multi-level human feedback.
We show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.
arXiv Detail & Related papers (2025-04-20T20:09:19Z) - Contextual bandits with entropy-based human feedback [8.94067320035758]
We introduce an entropy-based human feedback framework for contextual bandits.
Our approach achieves significant performance improvements while requiring minimal human feedback.
This work highlights the robustness and efficacy of incorporating human guidance into machine learning systems.
arXiv Detail & Related papers (2025-02-12T20:03:56Z) - Preference Optimization as Probabilistic Inference [21.95277469346728]
We propose a method that can leverage unpaired preferred or dis-preferred examples, and works even when only one type of feedback is available.
This flexibility allows us to apply it in scenarios with varying forms of feedback and models, including training generative language models.
arXiv Detail & Related papers (2024-10-05T14:04:03Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Marginal MAP Estimation for Inverse RL under Occlusion with Observer
Noise [9.670578317106182]
We consider the problem of learning the behavioral preferences of an expert engaged in a task from noisy and partially-observable demonstrations.
Previous techniques for inverse reinforcement learning (IRL) take the approach of either omitting the missing portions or inferring it as part of expectation-maximization.
We present a new method that generalizes the well-known Bayesian maximum-a-posteriori (MAP) IRL method by marginalizing the occluded portions of the trajectory.
arXiv Detail & Related papers (2021-09-16T08:20:52Z) - Learning from an Exploring Demonstrator: Optimal Reward Estimation for
Bandits [36.37578212532926]
We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance.
Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy.
We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms.
arXiv Detail & Related papers (2021-06-28T17:37:49Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Deep Reinforcement Learning with Dynamic Optimism [29.806071693039655]
We show that the optimal degree of optimism can vary both across tasks and over the course of learning.
Inspired by this insight, we introduce a novel deep actor-critic algorithm to switch between optimistic and pessimistic value learning online.
arXiv Detail & Related papers (2021-02-07T10:28:09Z) - Reinforcement Learning with Trajectory Feedback [76.94405309609552]
In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback.
Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory.
We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
arXiv Detail & Related papers (2020-08-13T17:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.