Related papers: Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit

Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit

URL: http://arxiv.org/abs/2308.10238v3
Date: Wed, 15 Nov 2023 11:00:20 GMT
Title: Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
Authors: Shintaro Nakamura, Masashi Sugiyama
Abstract summary: We study the real-valued pure exploration of the multi-armed bandit (R-CPE-MAB) problem. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$.
Score: 65.268245109828
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.

Related papers

Multi-objective Good Arm Identification with Bandit Feedback [0.589889361990138]
We consider a good arm identification problem in a bandit setting with multi-objectives. For each round $t$, the player/algorithm pulls one arm $i_t$ and receives a $M$ dimensional vector feedback sampled according to $mathcalD_i_t$. The proposed algorithm attains better numerical performance than other baselines in the experiments on synthetic and real datasets.
arXiv Detail & Related papers (2025-03-13T14:04:04Z)
Optimal Exploration is no harder than Thompson Sampling [14.726673043806391]
A pure exploration linear bandit problem aims to return $argmax_zin mathcalZ ztoptheta_ast with high probability through noisy measurements of $xtoptheta_ast with $xin mathcalXsubset mathbbRd$. This complexity is at odds with the popular and simple Thompson Sampling for regret minimization, which just requires access to a posterior sampling and argmax oracle, and does not need to enumerate $mathcalZ$
arXiv Detail & Related papers (2023-10-09T18:21:39Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation. We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z)
On the complexity of All $\varepsilon$-Best Arms Identification [2.1485350418225244]
We consider the problem of identifying all the $varepsilon$-optimal arms in a finite multi-armed bandit with Gaussian rewards. We propose a Track-and-Stop algorithm that identifies the set of $varepsilon$-good arms w.h.p and enjoys optimality (when $delta$ goes to zero) in terms of the expected sample complexity.
arXiv Detail & Related papers (2022-02-13T10:54:52Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback [18.29738891417779]
We first study the problem of pure exploration with full-bandit feedback (CPE-BL) In CPE-BL, each pull of action $x$ reports a random feedback vector with expectation of $M_x theta $, where $M_x in mathbbRd$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$. For CPE-PL, we develop the first emtime algorithm, which simultaneously addresses limited feedback, general reward function and action space.
arXiv Detail & Related papers (2020-06-14T13:59:59Z)
Explicit Best Arm Identification in Linear Bandits Using No-Regret Learners [17.224805430291177]
We study the problem of best arm identification in linearly parameterised multi-armed bandits. We propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem.
arXiv Detail & Related papers (2020-06-13T05:00:01Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy. For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.