Thompson Sampling for Real-Valued Combinatorial Pure Exploration of
Multi-Armed Bandit
- URL: http://arxiv.org/abs/2308.10238v3
- Date: Wed, 15 Nov 2023 11:00:20 GMT
- Title: Thompson Sampling for Real-Valued Combinatorial Pure Exploration of
Multi-Armed Bandit
- Authors: Shintaro Nakamura, Masashi Sugiyama
- Abstract summary: We study the real-valued pure exploration of the multi-armed bandit (R-CPE-MAB) problem.
We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$.
- Score: 65.268245109828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the real-valued combinatorial pure exploration of the multi-armed
bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic
arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown
distribution with mean $\mu_s$. In each time step, a player pulls a single arm
and observes its reward. The player's goal is to identify the optimal
\emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in
\mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized
real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few
arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size
of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm
named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm,
which is the first algorithm that can work even when the size of the action set
is exponentially large in $d$. We also introduce a novel problem-dependent
sample complexity lower bound of the R-CPE-MAB problem, and show that the
GenTS-Explore algorithm achieves the optimal sample complexity up to a
problem-dependent constant factor.
Related papers
- Optimal Exploration is no harder than Thompson Sampling [14.726673043806391]
A pure exploration linear bandit problem aims to return $argmax_zin mathcalZ ztoptheta_ast with high probability through noisy measurements of $xtoptheta_ast with $xin mathcalXsubset mathbbRd$.
This complexity is at odds with the popular and simple Thompson Sampling for regret minimization, which just requires access to a posterior sampling and argmax oracle, and does not need to enumerate $mathcalZ$
arXiv Detail & Related papers (2023-10-09T18:21:39Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - On the complexity of All $\varepsilon$-Best Arms Identification [2.1485350418225244]
We consider the problem of identifying all the $varepsilon$-optimal arms in a finite multi-armed bandit with Gaussian rewards.
We propose a Track-and-Stop algorithm that identifies the set of $varepsilon$-good arms w.h.p and enjoys optimality (when $delta$ goes to zero) in terms of the expected sample complexity.
arXiv Detail & Related papers (2022-02-13T10:54:52Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Combinatorial Pure Exploration with Full-Bandit or Partial Linear
Feedback [18.29738891417779]
We first study the problem of pure exploration with full-bandit feedback (CPE-BL)
In CPE-BL, each pull of action $x$ reports a random feedback vector with expectation of $M_x theta $, where $M_x in mathbbRd$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$.
For CPE-PL, we develop the first emtime algorithm, which simultaneously addresses limited feedback, general reward function and action space.
arXiv Detail & Related papers (2020-06-14T13:59:59Z) - Explicit Best Arm Identification in Linear Bandits Using No-Regret
Learners [17.224805430291177]
We study the problem of best arm identification in linearly parameterised multi-armed bandits.
We propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem.
arXiv Detail & Related papers (2020-06-13T05:00:01Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.