Related papers: Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning

Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning

URL: http://arxiv.org/abs/2110.00871v1
Date: Sat, 2 Oct 2021 20:10:40 GMT
Title: Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning
Authors: Tong Zhang
Abstract summary: We present a theoretical analysis of Thompson Sampling, with a focus on frequentist regret bounds. We show that the standard Thompson Sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations. We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson Sampling, and frequentist regret bounds for Feel-Good Thompson Sampling.
Score: 17.860102738896096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Thompson Sampling has been widely used for contextual bandit problems due to the flexibility of its modeling power. However, a general theory for this class of methods in the frequentist setting is still lacking. In this paper, we present a theoretical analysis of Thompson Sampling, with a focus on frequentist regret bounds. In this setting, we show that the standard Thompson Sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations. A simple modification called Feel-Good Thompson Sampling, which favors high reward models more aggressively than the standard Thompson Sampling, is proposed to remedy this problem. We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson Sampling, and frequentist regret bounds for Feel-Good Thompson Sampling. It is shown that in both cases, we can reduce the bandit regret problem to online least squares regression estimation. For the frequentist analysis, the online least squares regression bound can be directly obtained using online aggregation techniques which have been well studied. The resulting bandit regret bound matches the minimax lower bound in the finite action case. Moreover, the analysis can be generalized to handle a class of linearly embeddable contextual bandit problems (which generalizes the popular linear contextual bandit model). The obtained result again matches the minimax lower bound. Finally we illustrate that the analysis can be extended to handle some MDP problems.

Related papers

BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings [11.008537121214104]
We extend the linear Thompson sampling bandit to select actions based on a state-action utility function. We show that the proposed method can significantly out-perform standard Thompson sampling in terms of total return.
arXiv Detail & Related papers (2024-11-30T01:27:44Z)
Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework. We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits [1.8275108630751844]
We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits. We show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension.
arXiv Detail & Related papers (2021-10-23T08:51:49Z)
A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem. Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals. We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z)
On the Suboptimality of Thompson Sampling in High Dimensions [7.198362232890585]
We demonstrate that Thompson Sampling is sub-optimal for semi-bandits. We show that Thompson Sampling indeed can perform very poorly in high dimensions.
arXiv Detail & Related papers (2021-02-10T15:44:43Z)
Non-Stationary Latent Bandits [68.21614490603758]
We propose a practical approach for fast personalization to non-stationary users. The key idea is to frame this problem as a latent bandit, where prototypical models of user behavior are learned offline and the latent state of the user is inferred online. We propose Thompson sampling algorithms for regret minimization in non-stationary latent bandits, analyze them, and evaluate them on a real-world dataset.
arXiv Detail & Related papers (2020-12-01T10:31:57Z)
Analysis and Design of Thompson Sampling for Stochastic Partial Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring. We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z)
Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z)
Thompson Sampling for Linearly Constrained Bandits [18.477962150017095]
We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
arXiv Detail & Related papers (2020-04-20T13:06:35Z)
MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems. We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)
On Thompson Sampling for Smoother-than-Lipschitz Bandits [6.929312022493406]
We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions. Our bounds are realised by analysis of the eluder dimension. We derive a new bound on the eluder dimension for classes of functions with Lipschitz derivatives.
arXiv Detail & Related papers (2020-01-08T00:46:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.