Multinomial Logit Contextual Bandits: Provable Optimality and
Practicality
- URL: http://arxiv.org/abs/2103.13929v1
- Date: Thu, 25 Mar 2021 15:42:25 GMT
- Title: Multinomial Logit Contextual Bandits: Provable Optimality and
Practicality
- Authors: Min-hwan Oh, Garud Iyengar
- Abstract summary: We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown.
We propose upper confidence bound based algorithms for this MNL contextual bandit.
We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
- Score: 15.533842336139063
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We consider a sequential assortment selection problem where the user choice
is given by a multinomial logit (MNL) choice model whose parameters are
unknown. In each period, the learning agent observes a $d$-dimensional
contextual information about the user and the $N$ available items, and offers
an assortment of size $K$ to the user, and observes the bandit feedback of the
item chosen from the assortment. We propose upper confidence bound based
algorithms for this MNL contextual bandit. The first algorithm is a simple and
practical method which achieves an $\tilde{\mathcal{O}}(d\sqrt{T})$ regret over
$T$ rounds. Next, we propose a second algorithm which achieves a
$\tilde{\mathcal{O}}(\sqrt{dT})$ regret. This matches the lower bound for the
MNL bandit problem, up to logarithmic terms, and improves on the best known
result by a $\sqrt{d}$ factor. To establish this sharper regret bound, we
present a non-asymptotic confidence bound for the maximum likelihood estimator
of the MNL model that may be of independent interest as its own theoretical
contribution. We then revisit the simpler, significantly more practical, first
algorithm and show that a simple variant of the algorithm achieves the optimal
regret for a broad class of important applications.
Related papers
- Nearly Minimax Optimal Regret for Multinomial Logistic Bandit [8.087699764574788]
We study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information.
There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$.
We propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $tildeO(dsqrtsmash[b]T/K)$.
arXiv Detail & Related papers (2024-05-16T06:07:31Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Maillard Sampling: Boltzmann Exploration Done Optimally [11.282341369957216]
This thesis presents a randomized algorithm for the $K$-armed bandit problem.
Maillard sampling (MS) computes the probability of choosing each arm in a closed form.
We propose a variant of MS called MS$+$ that improves its minimax bound to $sqrtKTlogK$ without losing the optimality.
arXiv Detail & Related papers (2021-11-05T06:50:22Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Fully Gap-Dependent Bounds for Multinomial Logit Bandit [5.132017939561661]
We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items.
We present (i) an algorithm that identifies the optimal assortment $S*$ within $widetildeO(sum_i = 1N Delta_i-2)$ time steps with high probability, and (ii) an algorithm that incurs $O(sum_i notin S* KDelta_i
arXiv Detail & Related papers (2020-11-19T17:52:12Z) - Learning to Rank under Multinomial Logit Choice [6.929312022493406]
Learning the optimal ordering of content is an important challenge in website design.
We present theoretical analysis leading to an $Omega(sqrtJT)$ lower bound for the problem, and an $tildeO(sqrtJT)$ upper bound on regret of the UCB algorithm.
arXiv Detail & Related papers (2020-09-07T16:15:12Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Rate-adaptive model selection over a collection of black-box contextual
bandit algorithms [0.966840768820136]
We consider the model selection task in the contextual bandit setting.
Our proposal is the first one to be rate-adaptive for a collection of general black-box contextual bandit algorithms.
arXiv Detail & Related papers (2020-06-05T18:55:16Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.