Improved Regret Bounds of (Multinomial) Logistic Bandits via
Regret-to-Confidence-Set Conversion
- URL: http://arxiv.org/abs/2310.18554v3
- Date: Wed, 13 Mar 2024 02:58:07 GMT
- Title: Improved Regret Bounds of (Multinomial) Logistic Bandits via
Regret-to-Confidence-Set Conversion
- Authors: Junghyun Lee, Se-Young Yun, Kwang-Sung Jun
- Abstract summary: We construct a convex confidence set based on only the textitexistence of an online learning algorithm with a regret guarantee.
Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility.
We extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS.
- Score: 40.159846982245746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Logistic bandit is a ubiquitous framework of modeling users' choices, e.g.,
click vs. no click for advertisement recommender system. We observe that the
prior works overlook or neglect dependencies in $S \geq \lVert \theta_\star
\rVert_2$, where $\theta_\star \in \mathbb{R}^d$ is the unknown parameter
vector, which is particularly problematic when $S$ is large, e.g., $S \geq d$.
In this work, we improve the dependency on $S$ via a novel approach called {\it
regret-to-confidence set conversion (R2CS)}, which allows us to construct a
convex confidence set based on only the \textit{existence} of an online
learning algorithm with a regret guarantee. Using R2CS, we obtain a strict
improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining
computational feasibility and the dependence on other factors such as $d$ and
$T$. We apply our new confidence set to the regret analyses of logistic bandits
with a new martingale concentration step that circumvents an additional factor
of $S$. We then extend this analysis to multinomial logistic bandits and obtain
similar improvements in the regret, showing the efficacy of R2CS. While we
applied R2CS to the (multinomial) logistic model, R2CS is a generic approach
for developing confidence sets that can be used for various models, which can
be of independent interest.
Related papers
- Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with
Probabilistically Triggered Arms or Independent Arms [53.89752069984382]
We study the semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound.
First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we propose a BCUCB-T algorithm with variance-aware confidence intervals.
Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition.
arXiv Detail & Related papers (2022-08-31T13:09:39Z) - An Experimental Design Approach for Regret Minimization in Logistic
Bandits [26.674062544226636]
The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $kappa$.
We propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general.
arXiv Detail & Related papers (2022-02-04T21:56:40Z) - Scale Free Adversarial Multi Armed Bandits [13.757095663704858]
We consider the Scale-Free Adversarial Multi Armed Bandit(MAB) problem.
We design a Follow The Regularized Leader(FTRL) algorithm, which comes with the first scale-free regret guarantee for MAB.
We also develop a new technique for obtaining local-norm lower bounds for Bregman Divergences.
arXiv Detail & Related papers (2021-06-08T21:26:57Z) - Improved Analysis of Robustness of the Tsallis-INF Algorithm to
Adversarial Corruptions in Stochastic Multiarmed Bandits [12.462608802359936]
We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021).
In particular, for $C = Thetaleft(fracTlog Tlog T$)$, where $T$ is the time horizon, we achieve an improvement by a multiplicative factor.
We also improve the dependence of the regret bound on time horizon from $log T$ to $log frac(K-1)T(sum_ineq i*frac1Delta_
arXiv Detail & Related papers (2021-03-23T12:26:39Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation [107.06364966905821]
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
arXiv Detail & Related papers (2020-07-13T16:30:47Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Improved Optimistic Algorithms for Logistic Bandits [16.140301473601454]
We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function.
We show that it enjoys a $tildemathcalO(sqrtT)$ regret with no dependency in $kappa$, but for a second order term.
arXiv Detail & Related papers (2020-02-18T12:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.