Learning from an Exploring Demonstrator: Optimal Reward Estimation for
Bandits
- URL: http://arxiv.org/abs/2106.14866v1
- Date: Mon, 28 Jun 2021 17:37:49 GMT
- Title: Learning from an Exploring Demonstrator: Optimal Reward Estimation for
Bandits
- Authors: Wenshuo Guo, Kumar Krishna Agrawal, Aditya Grover, Vidya Muthukumar,
Ashwin Pananjady
- Abstract summary: We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance.
Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy.
We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms.
- Score: 36.37578212532926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the "inverse bandit" problem of estimating the rewards of a
multi-armed bandit instance from observing the learning process of a low-regret
demonstrator. Existing approaches to the related problem of inverse
reinforcement learning assume the execution of an optimal policy, and thereby
suffer from an identifiability issue. In contrast, our paradigm leverages the
demonstrator's behavior en route to optimality, and in particular, the
exploration phase, to obtain consistent reward estimates. We develop simple and
efficient reward estimation procedures for demonstrations within a class of
upper-confidence-based algorithms, showing that reward estimation gets
progressively easier as the regret of the algorithm increases. We match these
upper bounds with information-theoretic lower bounds that apply to any
demonstrator algorithm, thereby characterizing the optimal tradeoff between
exploration and reward estimation. Extensive empirical evaluations on both
synthetic data and simulated experimental design data from the natural sciences
corroborate our theoretical results.
Related papers
- Alpha and Prejudice: Improving $α$-sized Worst-case Fairness via Intrinsic Reweighting [34.954141077528334]
Worst-case fairness with off-the-shelf demographics group achieves parity by maximizing the model utility of the worst-off group.
Recent advances have reframed this learning problem by introducing the lower bound of minimal partition ratio.
arXiv Detail & Related papers (2024-11-05T13:04:05Z) - Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations.
We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games.
Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback [6.578074497549894]
Inverse Reinforcement Learning (IRL) and Reinforcement Learning from Human Feedback (RLHF) are pivotal methodologies in reward learning.
This paper introduces a novel linear programming (LP) framework tailored for offline reward learning.
arXiv Detail & Related papers (2024-05-20T23:59:26Z) - Rewarding Episodic Visitation Discrepancy for Exploration in
Reinforcement Learning [64.8463574294237]
We propose Rewarding Episodic Visitation Discrepancy (REVD) as an efficient and quantified exploration method.
REVD provides intrinsic rewards by evaluating the R'enyi divergence-based visitation discrepancy between episodes.
It is tested on PyBullet Robotics Environments and Atari games.
arXiv Detail & Related papers (2022-09-19T08:42:46Z) - IL-flOw: Imitation Learning from Observation using Normalizing Flows [28.998176144874193]
We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only.
Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods.
arXiv Detail & Related papers (2022-05-19T00:05:03Z) - Flexible and Efficient Contextual Bandits with Heterogeneous Treatment
Effect Oracle [12.906249996227904]
We design a statistically optimal and computationally efficient algorithm using heterogeneous treatment effect estimation oracles.
Our results provide the first universal reduction of contextual bandits to a general-purpose heterogeneous treatment effect estimation method.
We show that our approach is more robust to model misspecification than reward estimation methods based on squared error regression oracles.
arXiv Detail & Related papers (2022-03-30T20:43:43Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Reinforcement Learning with Trajectory Feedback [76.94405309609552]
In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback.
Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory.
We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
arXiv Detail & Related papers (2020-08-13T17:49:18Z) - Learning the Truth From Only One Side of the Story [58.65439277460011]
We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution.
We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically.
arXiv Detail & Related papers (2020-06-08T18:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.