Related papers: Multinomial Logit Contextual Bandits: Provable Optimality and Practicality

Multinomial Logit Contextual Bandits: Provable Optimality and Practicality

URL: http://arxiv.org/abs/2103.13929v1
Date: Thu, 25 Mar 2021 15:42:25 GMT
Title: Multinomial Logit Contextual Bandits: Provable Optimality and Practicality
Authors: Min-hwan Oh, Garud Iyengar
Abstract summary: We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. We propose upper confidence bound based algorithms for this MNL contextual bandit. We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
Score: 15.533842336139063
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. In each period, the learning agent observes a $d$-dimensional contextual information about the user and the $N$ available items, and offers an assortment of size $K$ to the user, and observes the bandit feedback of the item chosen from the assortment. We propose upper confidence bound based algorithms for this MNL contextual bandit. The first algorithm is a simple and practical method which achieves an $\tilde{\mathcal{O}}(d\sqrt{T})$ regret over $T$ rounds. Next, we propose a second algorithm which achieves a $\tilde{\mathcal{O}}(\sqrt{dT})$ regret. This matches the lower bound for the MNL bandit problem, up to logarithmic terms, and improves on the best known result by a $\sqrt{d}$ factor. To establish this sharper regret bound, we present a non-asymptotic confidence bound for the maximum likelihood estimator of the MNL model that may be of independent interest as its own theoretical contribution. We then revisit the simpler, significantly more practical, first algorithm and show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.

Related papers

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit [8.087699764574788]
We study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. We propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $tildeO(dsqrtsmash[b]T/K)$.
arXiv Detail & Related papers (2024-05-16T06:07:31Z)
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM) We study a model within this domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (RCDB), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z)
Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework. To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework. We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements. We derive an information-theoretic lower bound for the minimax risk under this setting. We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z)
Maillard Sampling: Boltzmann Exploration Done Optimally [11.282341369957216]
This thesis presents a randomized algorithm for the $K$-armed bandit problem. Maillard sampling (MS) computes the probability of choosing each arm in a closed form. We propose a variant of MS called MS$+$ that improves its minimax bound to $sqrtKTlogK$ without losing the optimality.
arXiv Detail & Related papers (2021-11-05T06:50:22Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
Fully Gap-Dependent Bounds for Multinomial Logit Bandit [5.132017939561661]
We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items. We present (i) an algorithm that identifies the optimal assortment $S*$ within $widetildeO(sum_i = 1N Delta_i-2)$ time steps with high probability, and (ii) an algorithm that incurs $O(sum_i notin S* KDelta_i
arXiv Detail & Related papers (2020-11-19T17:52:12Z)
Learning to Rank under Multinomial Logit Choice [6.929312022493406]
Learning the optimal ordering of content is an important challenge in website design. We present theoretical analysis leading to an $Omega(sqrtJT)$ lower bound for the problem, and an $tildeO(sqrtJT)$ upper bound on regret of the UCB algorithm.
arXiv Detail & Related papers (2020-09-07T16:15:12Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Rate-adaptive model selection over a collection of black-box contextual bandit algorithms [0.966840768820136]
We consider the model selection task in the contextual bandit setting. Our proposal is the first one to be rate-adaptive for a collection of general black-box contextual bandit algorithms.
arXiv Detail & Related papers (2020-06-05T18:55:16Z)
Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms. We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.