Related papers: MNL-Bandit in non-stationary environments

MNL-Bandit in non-stationary environments

URL: http://arxiv.org/abs/2303.02504v2
Date: Fri, 2 Jun 2023 01:29:18 GMT
Title: MNL-Bandit in non-stationary environments
Authors: Ayoub Foussoul, Vineet Goyal, Varun Gupta
Abstract summary: We present an algorithm with a worst-case expected regret of $tildeOleft( min left sqrtNTL;,; Nfrac13(Delta_inftyK)frac13 Tfrac23 + sqrtNTrightright)$. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
Score: 2.7737587389928793
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the MNL-Bandit problem in a non-stationary environment and present an algorithm with a worst-case expected regret of $\tilde{O}\left( \min \left\{ \sqrt{NTL}\;,\; N^{\frac{1}{3}}(\Delta_{\infty}^{K})^{\frac{1}{3}} T^{\frac{2}{3}} + \sqrt{NT}\right\}\right)$. Here $N$ is the number of arms, $L$ is the number of changes and $\Delta_{\infty}^{K}$ is a variation measure of the unknown parameters. Furthermore, we show matching lower bounds on the expected regret (up to logarithmic factors), implying that our algorithm is optimal. Our approach builds upon the epoch-based algorithm for stationary MNL-Bandit in Agrawal et al. 2016. However, non-stationarity poses several challenges and we introduce new techniques and ideas to address these. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.

Related papers

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning [6.236688509180343]
We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality $widetildemathcalObig(d H cdot min 1/rho, H/sqrtK big)$. We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting.
arXiv Detail & Related papers (2024-09-30T17:21:15Z)
LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z)
A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP) The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z)
A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity [28.068960555415014]
We investigate the fixed-budget best-arm identification problem for linear bandits in a potentially non-stationary environment. An algorithm will aim to correctly identify the best arm $x* := argmax_xinmathcalXxtopsum_t=1Ttheta_t$ with probability as high as possible.
arXiv Detail & Related papers (2023-07-27T19:03:36Z)
Tight Bounds for $\gamma$-Regret via the Decision-Estimation Coefficient [88.86699022151598]
We give a statistical characterization of the $gamma$-regret for arbitrary structured bandit problems. The $gamma$-regret emerges in structured bandit problems over a function class $mathcalF$ where finding an exact optimum of $f in mathcalF$ is intractable.
arXiv Detail & Related papers (2023-03-06T17:54:33Z)
Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits. We propose a refined analysis framework, which produces a simpler weight-based algorithm. Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z)
Parameter-free Regret in High Probability with Heavy Tails [45.11667443216861]
We present new algorithms for online convex optimization over unbounded domains. We obtain parameter-free regret in high-probability given access only to potentially heavy-tailed subgradient estimates.
arXiv Detail & Related papers (2022-10-25T21:43:44Z)
Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$. In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret. We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z)
Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic. This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z)
Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Tight Regret Bounds for Noisy Optimization of a Brownian Motion [118.65407541895851]
We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise. We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $Omega(sigmasqrtT cdot log T)$.
arXiv Detail & Related papers (2020-01-25T14:44:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.