Improved Optimistic Algorithms for Logistic Bandits
- URL: http://arxiv.org/abs/2002.07530v2
- Date: Mon, 8 Jun 2020 07:36:22 GMT
- Title: Improved Optimistic Algorithms for Logistic Bandits
- Authors: Louis Faury, Marc Abeille, Cl\'ement Calauz\`enes, Olivier Fercoq
- Abstract summary: We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function.
We show that it enjoys a $tildemathcalO(sqrtT)$ regret with no dependency in $kappa$, but for a second order term.
- Score: 16.140301473601454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generalized linear bandit framework has attracted a lot of attention in
recent years by extending the well-understood linear setting and allowing to
model richer reward structures. It notably covers the logistic model, widely
used when rewards are binary. For logistic bandits, the frequentist regret
guarantees of existing algorithms are $\tilde{\mathcal{O}}(\kappa \sqrt{T})$,
where $\kappa$ is a problem-dependent constant. Unfortunately, $\kappa$ can be
arbitrarily large as it scales exponentially with the size of the decision set.
This may lead to significantly loose regret bounds and poor empirical
performance. In this work, we study the logistic bandit with a focus on the
prohibitive dependencies introduced by $\kappa$. We propose a new optimistic
algorithm based on a finer examination of the non-linearities of the reward
function. We show that it enjoys a $\tilde{\mathcal{O}}(\sqrt{T})$ regret with
no dependency in $\kappa$, but for a second order term. Our analysis is based
on a new tail-inequality for self-normalized martingales, of independent
interest.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Improved Regret Bounds of (Multinomial) Logistic Bandits via
Regret-to-Confidence-Set Conversion [40.159846982245746]
We construct a convex confidence set based on only the textitexistence of an online learning algorithm with a regret guarantee.
Using R2CS, we obtain a strict improvement in the regret bound w.r.t. $S$ in logistic bandits while retaining computational feasibility.
We extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS.
arXiv Detail & Related papers (2023-10-28T01:27:52Z) - Improved High-Probability Regret for Adversarial Bandits with
Time-Varying Feedback Graphs [62.52390282012508]
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds.
We develop an algorithm that achieves the optimal regret $widetildemathcalO((sum_t=1Talpha_t)1/2+max_tin[T]alpha_t]$ with high probability.
We also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs.
arXiv Detail & Related papers (2022-10-04T04:36:15Z) - Leveraging Initial Hints for Free in Stochastic Linear Bandits [38.58449930961883]
We study the setting of optimizing with bandit feedback with additional prior knowledge provided to a learner in the form of an initial hint of the optimal action.
We present a novel algorithm for linear bandits that uses this hint to improve its regret to $tilde O(sqrtT)$ when the hint is accurate.
Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance.
arXiv Detail & Related papers (2022-03-08T18:48:55Z) - An Experimental Design Approach for Regret Minimization in Logistic
Bandits [26.674062544226636]
The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $kappa$.
We propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general.
arXiv Detail & Related papers (2022-02-04T21:56:40Z) - Instance-Wise Minimax-Optimal Algorithms for Logistic Bandits [9.833844886421694]
Logistic Bandits have attracted substantial attention, by providing an uncluttered yet challenging framework for understanding the impact of non-linearity in parametrized bandits.
We introduce a novel algorithm for which we provide a refined analysis of the effect of non-linearity.
arXiv Detail & Related papers (2020-10-23T20:07:31Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal
Algorithm Escaping the Curse of Horizon [88.75843804630772]
Episodic reinforcement learning generalizes contextual bandits.
Long planning horizon and unknown state-dependent transitions pose little additional difficulty on sample complexity.
We show MVP enjoys an $left(left(sqrtSAK + S2Aright)$ regret, approaching the $Omegaleft(sqrtSAK + S2Aright)$ lower bound of emphcontextual bandits up to logarithmic terms.
arXiv Detail & Related papers (2020-09-28T17:52:32Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.