Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement
Learning
- URL: http://arxiv.org/abs/2110.00871v1
- Date: Sat, 2 Oct 2021 20:10:40 GMT
- Title: Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement
Learning
- Authors: Tong Zhang
- Abstract summary: We present a theoretical analysis of Thompson Sampling, with a focus on frequentist regret bounds.
We show that the standard Thompson Sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations.
We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson Sampling, and frequentist regret bounds for Feel-Good Thompson Sampling.
- Score: 17.860102738896096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thompson Sampling has been widely used for contextual bandit problems due to
the flexibility of its modeling power. However, a general theory for this class
of methods in the frequentist setting is still lacking. In this paper, we
present a theoretical analysis of Thompson Sampling, with a focus on
frequentist regret bounds. In this setting, we show that the standard Thompson
Sampling is not aggressive enough in exploring new actions, leading to
suboptimality in some pessimistic situations. A simple modification called
Feel-Good Thompson Sampling, which favors high reward models more aggressively
than the standard Thompson Sampling, is proposed to remedy this problem. We
show that the theoretical framework can be used to derive Bayesian regret
bounds for standard Thompson Sampling, and frequentist regret bounds for
Feel-Good Thompson Sampling. It is shown that in both cases, we can reduce the
bandit regret problem to online least squares regression estimation. For the
frequentist analysis, the online least squares regression bound can be directly
obtained using online aggregation techniques which have been well studied. The
resulting bandit regret bound matches the minimax lower bound in the finite
action case. Moreover, the analysis can be generalized to handle a class of
linearly embeddable contextual bandit problems (which generalizes the popular
linear contextual bandit model). The obtained result again matches the minimax
lower bound. Finally we illustrate that the analysis can be extended to handle
some MDP problems.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Analysis of Thompson Sampling for Partially Observable Contextual
Multi-Armed Bandits [1.8275108630751844]
We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits.
We show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension.
arXiv Detail & Related papers (2021-10-23T08:51:49Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - On the Suboptimality of Thompson Sampling in High Dimensions [7.198362232890585]
We demonstrate that Thompson Sampling is sub-optimal for semi-bandits.
We show that Thompson Sampling indeed can perform very poorly in high dimensions.
arXiv Detail & Related papers (2021-02-10T15:44:43Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state.
We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling.
We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z) - Thompson Sampling for Linearly Constrained Bandits [18.477962150017095]
We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round.
We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
arXiv Detail & Related papers (2020-04-20T13:06:35Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z) - On Thompson Sampling for Smoother-than-Lipschitz Bandits [6.929312022493406]
We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions.
Our bounds are realised by analysis of the eluder dimension.
We derive a new bound on the eluder dimension for classes of functions with Lipschitz derivatives.
arXiv Detail & Related papers (2020-01-08T00:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.