Thompson Exploration with Best Challenger Rule in Best Arm
Identification
- URL: http://arxiv.org/abs/2310.00539v1
- Date: Sun, 1 Oct 2023 01:37:02 GMT
- Title: Thompson Exploration with Best Challenger Rule in Best Arm
Identification
- Authors: Jongyeong Lee, Junya Honda, Masashi Sugiyama
- Abstract summary: We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
- Score: 66.33448474838342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the fixed-confidence best arm identification (BAI) problem
in the bandit framework in the canonical single-parameter exponential models.
For this problem, many policies have been proposed, but most of them require
solving an optimization problem at every round and/or are forced to explore an
arm at least a certain number of times except those restricted to the Gaussian
model. To address these limitations, we propose a novel policy that combines
Thompson sampling with a computationally efficient approach known as the best
challenger rule. While Thompson sampling was originally considered for
maximizing the cumulative reward, we demonstrate that it can be used to
naturally explore arms in BAI without forcing it. We show that our policy is
asymptotically optimal for any two-armed bandit problems and achieves near
optimality for general $K$-armed bandit problems for $K\geq 3$. Nevertheless,
in numerical experiments, our policy shows competitive performance compared to
asymptotically optimal policies in terms of sample complexity while requiring
less computation cost. In addition, we highlight the advantages of our policy
by comparing it to the concept of $\beta$-optimality, a relaxed notion of
asymptotic optimality commonly considered in the analysis of a class of
policies including the proposed one.
Related papers
- Optimal Multi-Fidelity Best-Arm Identification [65.23078799972188]
In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible.
We study multi-fidelity best-arm identification, in which the can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost.
Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm.
arXiv Detail & Related papers (2024-06-05T08:02:40Z) - Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the base policy is equal to $log (n) - (n-1)/n.$
We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Restless Bandits with Average Reward: Breaking the Uniform Global
Attractor Assumption [12.471848976031904]
A fundamental goal is to efficiently compute policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large.
Existing results on optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption.
We propose a general, simulation-based framework, that converts any single-armed policy into a policy for the original $N$-armed problem.
arXiv Detail & Related papers (2023-05-31T21:26:43Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Restless Bandits with Many Arms: Beating the Central Limit Theorem [25.639496138046546]
finite-horizon restless bandits with multiple pulls per period play an important role in recommender systems, active learning, revenue management, and many other areas.
While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$.
We characterize a non-degeneracy condition and a class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$.
arXiv Detail & Related papers (2021-07-25T23:27:12Z) - Thompson Sampling for Unimodal Bandits [21.514495320038712]
We propose a Thompson Sampling algorithm for emphunimodal bandits, where the expected reward is unimodal over the partially ordered arms.
For Gaussian rewards, the regret of our algorithm is $mathcalO(log T)$, which is far better than standard Thompson Sampling algorithms.
arXiv Detail & Related papers (2021-06-15T14:40:34Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.