Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian
rewards
- URL: http://arxiv.org/abs/2304.13593v1
- Date: Wed, 26 Apr 2023 14:40:01 GMT
- Title: Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian
rewards
- Authors: Amaury Gouverneur, Borja Rodr\'iguez-G\'alvez, Tobias J. Oechtering,
and Mikael Skoglund
- Abstract summary: We study the performance of the Thompson Sampling algorithm for Contextual Bandit problems.
We introduce new bounds on the lifted information ratio that hold for sub-Gaussian rewards.
- Score: 44.025369660607645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we study the performance of the Thompson Sampling algorithm for
Contextual Bandit problems based on the framework introduced by Neu et al. and
their concept of lifted information ratio. First, we prove a comprehensive
bound on the Thompson Sampling expected cumulative regret that depends on the
mutual information of the environment parameters and the history. Then, we
introduce new bounds on the lifted information ratio that hold for sub-Gaussian
rewards, thus generalizing the results from Neu et al. which analysis requires
binary rewards. Finally, we provide explicit regret bounds for the special
cases of unstructured bounded contextual bandits, structured bounded contextual
bandits with Laplace likelihood, structured Bernoulli bandits, and bounded
linear contextual bandits.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis [4.297070083645049]
We investigate a contextual linear bandit problem where the agent observes a noisy, corrupted version of the true context.
Our objective is to design an action policy that can approximate" that of an oracle.
arXiv Detail & Related papers (2024-01-21T18:57:38Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Lifting the Information Ratio: An Information-Theoretic Analysis of
Thompson Sampling for Contextual Bandits [17.470829701201435]
We adapt the information-theoretic perspective of Russo and Van Roy [2016] to the contextual setting by introducing a new concept of information ratio.
This allows us to bound the regret in terms of the entropy of the prior distribution through a remarkably simple proof.
An interesting special case is that of logistic bandits with d-dimensional parameters, K actions, and Lipschitz logits, for which we provide a $widetildeO(sqrtdKT)$ regret upper-bound that does not depend on the smallest slope of the sigmoid link function.
arXiv Detail & Related papers (2022-05-27T12:04:07Z) - Thompson Sampling for Bandits with Clustered Arms [7.237493755167875]
We show, theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost.
Our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.
arXiv Detail & Related papers (2021-09-06T08:58:01Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Greedy Bandits with Sampled Context [0.0]
Greedy Bandits with Sampled Context (GB-SC) is a method for contextual multi-armed bandits to develop the prior from context information.
Our results show competitive performance on the Mushroom environment in terms of expected regret and expected cumulative regret.
arXiv Detail & Related papers (2020-07-27T17:17:45Z) - On Thompson Sampling for Smoother-than-Lipschitz Bandits [6.929312022493406]
We provide the first bounds on the regret of Thompson Sampling for continuum armed bandits under weak conditions.
Our bounds are realised by analysis of the eluder dimension.
We derive a new bound on the eluder dimension for classes of functions with Lipschitz derivatives.
arXiv Detail & Related papers (2020-01-08T00:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.