Pure Exploration with Structured Preference Feedback
- URL: http://arxiv.org/abs/2104.05294v1
- Date: Mon, 12 Apr 2021 08:57:29 GMT
- Title: Pure Exploration with Structured Preference Feedback
- Authors: Shubham Gupta, Aadirupa Saha, and Sumeet Katariya
- Abstract summary: We consider the problem of pure exploration with subset-wise preference feedback, which contains $N$ arms with features.
We present two algorithms that guarantee the detection of the best-arm in $tildeO (fracd2K Delta2)$ samples with probability at least $1.
- Score: 25.894827160719526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of pure exploration with subset-wise preference
feedback, which contains $N$ arms with features. The learner is allowed to
query subsets of size $K$ and receives feedback in the form of a noisy winner.
The goal of the learner is to identify the best arm efficiently using as few
queries as possible. This setting is relevant in various online decision-making
scenarios involving human feedback such as online retailing, streaming
services, news feed, and online advertising; since it is easier and more
reliable for people to choose a preferred item from a subset than to assign a
likability score to an item in isolation. To the best of our knowledge, this is
the first work that considers the subset-wise preference feedback model in a
structured setting, which allows for potentially infinite set of arms. We
present two algorithms that guarantee the detection of the best-arm in
$\tilde{O} (\frac{d^2}{K \Delta^2})$ samples with probability at least $1 -
\delta$, where $d$ is the dimension of the arm-features and $\Delta$ is the
appropriate notion of utility gap among the arms. We also derive an
instance-dependent lower bound of $\Omega(\frac{d}{\Delta^2} \log
\frac{1}{\delta})$ which matches our upper bound on a worst-case instance.
Finally, we run extensive experiments to corroborate our theoretical findings,
and observe that our adaptive algorithm stops and requires up to 12x fewer
samples than a non-adaptive algorithm.
Related papers
- Optimal Multi-Fidelity Best-Arm Identification [65.23078799972188]
In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible.
We study multi-fidelity best-arm identification, in which the can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost.
Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm.
arXiv Detail & Related papers (2024-06-05T08:02:40Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Optimal Top-Two Method for Best Arm Identification and Fluid Analysis [15.353009236788262]
We propose an optimal top-2 type algorithm for the best arm identification problem.
We show that the proposed algorithm is optimal as $delta rightarrow 0$.
arXiv Detail & Related papers (2024-03-14T06:14:07Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Selective Sampling for Online Best-arm Identification [19.767267982167578]
Given a set of potential options, a learner aims to compute with probability greater than $1-delta$.
The main results of this work precisely characterize this trade-off between labeled samples and stopping time.
Our framework is general enough to capture binary classification improving upon previous works.
arXiv Detail & Related papers (2021-10-28T03:02:08Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting.
We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.