An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits
- URL: http://arxiv.org/abs/2412.02861v2
- Date: Thu, 20 Feb 2025 18:24:53 GMT
- Title: An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits
- Authors: Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund,
- Abstract summary: We study the performance of the Thompson Sampling algorithm for logistic bandit problems.
We derive a bound order $O(d/alphaqrtT log(beta T/d))$ of regret incurred after $T$ expected of Sampling steps.
- Score: 36.37704574907495
- License:
- Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with slope parameter $\beta>0$, and where both the action $a\in \mathcal{A}$ and parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy (2016), we analyze the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred and the information gained about the optimal action. We improve upon previous results by establishing that the information ratio is bounded by $\tfrac{9}{2}d\alpha^{-2}$, where $\alpha$ is a minimax measure of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$, and is independent of $\beta$. Using this result, we derive a bound of order $O(d/\alpha\sqrt{T \log(\beta T/d)})$ on the Bayesian expected regret of Thompson Sampling incurred after $T$ time steps. To our knowledge, this is the first regret bound for logistic bandits that depends only logarithmically on $\beta$ while being independent of the number of actions. In particular, when the action space contains the parameter space, the bound on the expected regret is of order $\tilde{O}(d \sqrt{T})$.
Related papers
- New Rates in Stochastic Decision-Theoretic Online Learning under Differential Privacy [17.711455925206298]
Hu and Mehta (2024) posed an open problem: what is the optimal instance-dependent rate for the decision-theoretic online learning (with $K$ actions and $T$ rounds) under $varepsilon$-differential privacy?
In this paper, we partially address this open problem by having two new results. First, we provide an improved upper bound for this problem $Oleft(fraclog KDelta_min + fraclog2Kvarepsilonright)$, where the $T$-dependency has been removed.
Second,
arXiv Detail & Related papers (2025-02-16T05:13:51Z) - Low-rank Matrix Bandits with Heavy-tailed Rewards [55.03293214439741]
We study the problem of underlinelow-rank matrix bandit with underlineheavy-underlinetailed underlinerewards (LowHTR)
By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS.
arXiv Detail & Related papers (2024-04-26T21:54:31Z) - Sparse Recovery with Shuffled Labels: Statistical Limits and Practical
Estimators [23.313461266708877]
We reconstruct the permutation matrix $bPitrue$ and the sparse signal $bbetatrue$ from shuffled labels.
We show that our proposed estimator can obtain the ground-truth $(bPitrue, supp(bbetatrue))$ under mild conditions.
arXiv Detail & Related papers (2023-03-20T16:14:58Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Cascading Bandit under Differential Privacy [21.936577816668944]
We study emphdifferential privacy (DP) and emphlocal differential privacy (LDP) in cascading bandits.
Under DP, we propose an algorithm which guarantees $epsilon$-indistinguishability and a regret of $mathcalO(fraclog3 Tepsilon)$ for an arbitrarily small $xi$.
Under ($epsilon$,$delta$)-LDP, we relax the $K2$ dependence through the tradeoff between privacy budgetepsilon$ and error probability $
arXiv Detail & Related papers (2021-05-24T07:19:01Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed
Payoffs [35.988644745703645]
We analyze the linear bandits with heavy-tailed payoffs, where the payoffs admit finite $1+epsilon$ moments.
We propose two novel algorithms which enjoy a sublinear regret bound of $widetildeO(dfrac12Tfrac11+epsilon)$.
arXiv Detail & Related papers (2020-04-28T13:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.