Related papers: An Experimental Design Approach for Regret Minimization in Logistic Bandits

An Experimental Design Approach for Regret Minimization in Logistic Bandits

URL: http://arxiv.org/abs/2202.02407v1
Date: Fri, 4 Feb 2022 21:56:40 GMT
Title: An Experimental Design Approach for Regret Minimization in Logistic Bandits
Authors: Blake Mason, Kwang-Sung Jun, Lalit Jain
Abstract summary: The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $kappa$. We propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general.
Score: 26.674062544226636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $\kappa$ that can at worst scale exponentially with the norm of the unknown parameter $\theta_{\ast}$. Abeille et al. (2021) have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like $O(d\log^2(\kappa)\sqrt{\dot\mu T}\log(|\mathcal{X}|))$ where $d$ is the dimensionality, $T$ is the time horizon, and $\dot\mu$ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of $O(\sqrt{d \dot\mu T\log(|\mathcal{X}|)})$. Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on $\kappa$ to $\log^2(\kappa)$ for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where $d^2$ lower order regret (cf., it is $d$ for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to $d$.

Related papers

p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds. Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z)
An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits [36.37704574907495]
We study the performance of the Thompson Sampling algorithm for logistic bandit problems. We derive a bound order $O(d/alphaqrtT log(beta T/d))$ of regret incurred after $T$ expected of Sampling steps.
arXiv Detail & Related papers (2024-12-03T21:55:41Z)
Batched Stochastic Bandit for Nondegenerate Functions [8.015503209312786]
This paper studies batched bandit learning problems for nondegenerate functions. We introduce an algorithm that solves the batched bandit problem for nondegenerate functions near-optimally.
arXiv Detail & Related papers (2024-05-09T12:50:16Z)
Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay. We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general. We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z)
Improved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set Conversion [40.159846982245746]
We construct a convex confidence set based on only the textitexistence of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility. We extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS.
arXiv Detail & Related papers (2023-10-28T01:27:52Z)
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z)
Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$. In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret. We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z)
Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game [9.933208900617174]
We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG. To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $ellt$ by $v(zt)$ and minimize the regret. We present a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(sqrtX B/T+sqrtY C/T)$ to $O(
arXiv Detail & Related papers (2022-03-11T13:45:42Z)
Scale Free Adversarial Multi Armed Bandits [13.757095663704858]
We consider the Scale-Free Adversarial Multi Armed Bandit(MAB) problem. We design a Follow The Regularized Leader(FTRL) algorithm, which comes with the first scale-free regret guarantee for MAB. We also develop a new technique for obtaining local-norm lower bounds for Bregman Divergences.
arXiv Detail & Related papers (2021-06-08T21:26:57Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state. Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z)
Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support. We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z)
Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z)
Improved Optimistic Algorithms for Logistic Bandits [16.140301473601454]
We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function. We show that it enjoys a $tildemathcalO(sqrtT)$ regret with no dependency in $kappa$, but for a second order term.
arXiv Detail & Related papers (2020-02-18T12:52:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.