An Experimental Design Approach for Regret Minimization in Logistic
Bandits
- URL: http://arxiv.org/abs/2202.02407v1
- Date: Fri, 4 Feb 2022 21:56:40 GMT
- Title: An Experimental Design Approach for Regret Minimization in Logistic
Bandits
- Authors: Blake Mason, Kwang-Sung Jun, Lalit Jain
- Abstract summary: The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $kappa$.
We propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general.
- Score: 26.674062544226636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we consider the problem of regret minimization for logistic
bandits. The main challenge of logistic bandits is reducing the dependence on a
potentially large problem dependent constant $\kappa$ that can at worst scale
exponentially with the norm of the unknown parameter $\theta_{\ast}$. Abeille
et al. (2021) have applied self-concordance of the logistic function to remove
this worst-case dependence providing regret guarantees like
$O(d\log^2(\kappa)\sqrt{\dot\mu T}\log(|\mathcal{X}|))$ where $d$ is the
dimensionality, $T$ is the time horizon, and $\dot\mu$ is the variance of the
best-arm. This work improves upon this bound in the fixed arm setting by
employing an experimental design procedure that achieves a minimax regret of
$O(\sqrt{d \dot\mu T\log(|\mathcal{X}|)})$. Our regret bound in fact takes a
tighter instance (i.e., gap) dependent regret bound for the first time in
logistic bandits. We also propose a new warmup sampling algorithm that can
dramatically reduce the lower order term in the regret in general and prove
that it can replace the lower order term dependency on $\kappa$ to
$\log^2(\kappa)$ for some instances. Finally, we discuss the impact of the bias
of the MLE on the logistic bandit problem, providing an example where $d^2$
lower order regret (cf., it is $d$ for linear bandits) may not be improved as
long as the MLE is used and how bias-corrected estimators may be used to make
it closer to $d$.
Related papers
- Batched Stochastic Bandit for Nondegenerate Functions [8.015503209312786]
This paper studies batched bandit learning problems for nondegenerate functions.
We introduce an algorithm that solves the batched bandit problem for nondegenerate functions near-optimally.
arXiv Detail & Related papers (2024-05-09T12:50:16Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Improved Regret Bounds of (Multinomial) Logistic Bandits via
Regret-to-Confidence-Set Conversion [40.159846982245746]
We construct a convex confidence set based on only the textitexistence of an online learning algorithm with a regret guarantee.
Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility.
We extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS.
arXiv Detail & Related papers (2023-10-28T01:27:52Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Generalized Bandit Regret Minimizer Framework in Imperfect Information
Extensive-Form Game [9.933208900617174]
We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG.
To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $ellt$ by $v(zt)$ and minimize the regret.
We present a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(sqrtX B/T+sqrtY C/T)$ to $O(
arXiv Detail & Related papers (2022-03-11T13:45:42Z) - Scale Free Adversarial Multi Armed Bandits [13.757095663704858]
We consider the Scale-Free Adversarial Multi Armed Bandit(MAB) problem.
We design a Follow The Regularized Leader(FTRL) algorithm, which comes with the first scale-free regret guarantee for MAB.
We also develop a new technique for obtaining local-norm lower bounds for Bregman Divergences.
arXiv Detail & Related papers (2021-06-08T21:26:57Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z) - Improved Optimistic Algorithms for Logistic Bandits [16.140301473601454]
We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function.
We show that it enjoys a $tildemathcalO(sqrtT)$ regret with no dependency in $kappa$, but for a second order term.
arXiv Detail & Related papers (2020-02-18T12:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.