Contextual Bandits and Imitation Learning via Preference-Based Active
Queries
- URL: http://arxiv.org/abs/2307.12926v1
- Date: Mon, 24 Jul 2023 16:36:04 GMT
- Title: Contextual Bandits and Imitation Learning via Preference-Based Active
Queries
- Authors: Ayush Sekhari, Karthik Sridharan, Wen Sun, Runzhe Wu
- Abstract summary: We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward.
Instead, the learner can actively query an expert at each round to compare two actions and receive noisy preference feedback.
The learner's objective is two-fold: to minimize the regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert.
- Score: 17.73844193143454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of contextual bandits and imitation learning, where
the learner lacks direct knowledge of the executed action's reward. Instead,
the learner can actively query an expert at each round to compare two actions
and receive noisy preference feedback. The learner's objective is two-fold: to
minimize the regret associated with the executed actions, while simultaneously,
minimizing the number of comparison queries made to the expert. In this paper,
we assume that the learner has access to a function class that can represent
the expert's preference model under appropriate link functions, and provide an
algorithm that leverages an online regression oracle with respect to this
function class for choosing its actions and deciding when to query. For the
contextual bandit setting, our algorithm achieves a regret bound that combines
the best of both worlds, scaling as $O(\min\{\sqrt{T}, d/\Delta\})$, where $T$
represents the number of interactions, $d$ represents the eluder dimension of
the function class, and $\Delta$ represents the minimum preference of the
optimal action over any suboptimal action under all contexts. Our algorithm
does not require the knowledge of $\Delta$, and the obtained regret bound is
comparable to what can be achieved in the standard contextual bandits setting
where the learner observes reward signals at each round. Additionally, our
algorithm makes only $O(\min\{T, d^2/\Delta^2\})$ queries to the expert. We
then extend our algorithm to the imitation learning setting, where the learning
agent engages with an unknown environment in episodes of length $H$ each, and
provide similar guarantees for regret and query complexity. Interestingly, our
algorithm for imitation learning can even learn to outperform the underlying
expert, when it is suboptimal, highlighting a practical benefit of
preference-based feedback in imitation learning.
Related papers
- Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Online Learning in Contextual Second-Price Pay-Per-Click Auctions [47.06746975822902]
We study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads.
The learner's goal is to minimize her regret, defined as the gap between her total revenue and that of an oracle strategy.
arXiv Detail & Related papers (2023-10-08T07:04:22Z) - On the Sample Complexity of Representation Learning in Multi-task
Bandits with Global and Local structure [77.60508571062958]
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems.
Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor)
We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(Glog(delta_G)+ Xlog(delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors.
arXiv Detail & Related papers (2022-11-28T08:40:12Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - Teaching an Active Learner with Contrastive Examples [35.926575235046634]
We study the problem of active learning with the added twist that the learner is assisted by a helpful teacher.
We investigate an efficient teaching algorithm that adaptively picks contrastive examples.
We derive strong performance guarantees for our algorithm based on two problem-dependent parameters.
arXiv Detail & Related papers (2021-10-28T05:00:55Z) - The Information Geometry of Unsupervised Reinforcement Learning [133.20816939521941]
Unsupervised skill discovery is a class of algorithms that learn a set of policies without access to a reward function.
We show that unsupervised skill discovery algorithms do not learn skills that are optimal for every possible reward function.
arXiv Detail & Related papers (2021-10-06T13:08:36Z) - Active Learning for Contextual Search with Binary Feedbacks [2.6424064030995957]
We study the learning problem in contextual search motivated by applications such as first-price auction.
We propose a tri-section search approach combined with a margin-based active learning method.
arXiv Detail & Related papers (2021-10-03T19:05:29Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.