Related papers: Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits

Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits

URL: http://arxiv.org/abs/2106.14866v1
Date: Mon, 28 Jun 2021 17:37:49 GMT
Title: Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits
Authors: Wenshuo Guo, Kumar Krishna Agrawal, Aditya Grover, Vidya Muthukumar, Ashwin Pananjady
Abstract summary: We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy. We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms.
Score: 36.37578212532926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, our paradigm leverages the demonstrator's behavior en route to optimality, and in particular, the exploration phase, to obtain consistent reward estimates. We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms, showing that reward estimation gets progressively easier as the regret of the algorithm increases. We match these upper bounds with information-theoretic lower bounds that apply to any demonstrator algorithm, thereby characterizing the optimal tradeoff between exploration and reward estimation. Extensive empirical evaluations on both synthetic data and simulated experimental design data from the natural sciences corroborate our theoretical results.

Related papers

Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design [11.313040194648828]
We study reinforcement learning from human feedback in general Markov decision processes.<n>A central challenge is to design algorithms that select informative preference queries to identify the underlying reward.<n>We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches.
arXiv Detail & Related papers (2025-06-11T08:27:16Z)
Alpha and Prejudice: Improving $α$-sized Worst-case Fairness via Intrinsic Reweighting [34.954141077528334]
Worst-case fairness with off-the-shelf demographics group achieves parity by maximizing the model utility of the worst-off group. Recent advances have reframed this learning problem by introducing the lower bound of minimal partition ratio.
arXiv Detail & Related papers (2024-11-05T13:04:05Z)
Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback [6.578074497549894]
Inverse Reinforcement Learning (IRL) and Reinforcement Learning from Human Feedback (RLHF) are pivotal methodologies in reward learning. This paper introduces a novel linear programming (LP) framework tailored for offline reward learning.
arXiv Detail & Related papers (2024-05-20T23:59:26Z)
Beyond Expectations: Learning with Stochastic Dominance Made Practical [88.06211893690964]
dominance models risk-averse preferences for decision making with uncertain outcomes. Despite theoretically appealing, the application of dominance in machine learning has been scarce. We first generalize the dominance concept to enable feasible comparisons between any arbitrary pair of random variables. We then develop a simple and efficient approach for finding the optimal solution in terms of dominance.
arXiv Detail & Related papers (2024-02-05T03:21:23Z)
Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning [64.8463574294237]
We propose Rewarding Episodic Visitation Discrepancy (REVD) as an efficient and quantified exploration method. REVD provides intrinsic rewards by evaluating the R'enyi divergence-based visitation discrepancy between episodes. It is tested on PyBullet Robotics Environments and Atari games.
arXiv Detail & Related papers (2022-09-19T08:42:46Z)
IL-flOw: Imitation Learning from Observation using Normalizing Flows [28.998176144874193]
We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods.
arXiv Detail & Related papers (2022-05-19T00:05:03Z)
Flexible and Efficient Contextual Bandits with Heterogeneous Treatment Effect Oracle [12.906249996227904]
We design a statistically optimal and computationally efficient algorithm using heterogeneous treatment effect estimation oracles. Our results provide the first universal reduction of contextual bandits to a general-purpose heterogeneous treatment effect estimation method. We show that our approach is more robust to model misspecification than reward estimation methods based on squared error regression oracles.
arXiv Detail & Related papers (2022-03-30T20:43:43Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z)
Reinforcement Learning with Trajectory Feedback [76.94405309609552]
In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as emphtrajectory feedback. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret.
arXiv Detail & Related papers (2020-08-13T17:49:18Z)
Learning the Truth From Only One Side of the Story [58.65439277460011]
We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge suboptimally or even fail to converge to the optimal solution. We propose an adaptive approach that comes with theoretical guarantees and show that it outperforms several existing methods empirically.
arXiv Detail & Related papers (2020-06-08T18:20:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.