MNL-Bandit with Knapsacks: a near-optimal algorithm
- URL: http://arxiv.org/abs/2106.01135v5
- Date: Wed, 24 Jan 2024 14:56:44 GMT
- Title: MNL-Bandit with Knapsacks: a near-optimal algorithm
- Authors: Abdellah Aznag, Vineet Goyal and Noemie Perivier
- Abstract summary: We consider a dynamic assortment selection problem where a seller has a fixed inventory of $N$ substitutable products.
In each period, the seller needs to decide on the assortment of products to offer to the customers.
We show that when the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a $tildeO(N + sqrtNT)$ regret bound.
- Score: 2.3020018305241337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a dynamic assortment selection problem where a seller has a fixed
inventory of $N$ substitutable products and faces an unknown demand that
arrives sequentially over $T$ periods. In each period, the seller needs to
decide on the assortment of products (satisfying certain constraints) to offer
to the customers. The customer's response follows an unknown multinomial logit
model (MNL) with parameter $\boldsymbol{v}$. If customer selects product $i \in
[N]$, the seller receives revenue $r_i$. The goal of the seller is to maximize
the total expected revenue from the $T$ customers given the fixed initial
inventory of $N$ products. We present MNLwK-UCB, a UCB-based algorithm and
characterize its regret under different regimes of inventory size. We show that
when the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a
$\tilde{O}(N + \sqrt{NT})$ regret bound. We also show that for a smaller
inventory (with growth $\sim T^{\alpha}$, $\alpha < 1$), MNLwK-UCB achieves a
$\tilde{O}(N(1 + T^{\frac{1 - \alpha}{2}}) + \sqrt{NT})$. In particular, over a
long time horizon $T$, the rate $\tilde{O}(\sqrt{NT})$ is always achieved
regardless of the constraints and the size of the inventory.
Related papers
- Improved Algorithms for Contextual Dynamic Pricing [24.530341596901476]
In contextual dynamic pricing, a seller sequentially prices goods based on contextual information.
Our algorithm achieves an optimal regret bound of $tildemathcalO(T2/3)$, improving the existing results.
For this model, our algorithm obtains a regret $tildemathcalO(Td+2beta/d+3beta)$, where $d$ is the dimension of the context space.
arXiv Detail & Related papers (2024-06-17T08:26:51Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - No-Regret Learning in Partially-Informed Auctions [85.67897346422122]
We study a machine learning formulation of auctions with partially-revealed information.
In each round, a new item is drawn from an unknown distribution and the platform publishes a price together with incomplete, "masked" information about the item.
We show that when the distribution over items is known to the buyer and the mask is a SimHash function mapping $mathbbRd$ to $0,1ell$, our algorithm has regret $tilde mathcalO((Tdell)frac12)$.
arXiv Detail & Related papers (2022-02-22T01:15:51Z) - Dynamic pricing and assortment under a contextual MNL demand [2.1320960069210475]
We consider dynamic multi-product pricing and assortment problems under an unknown demand over T periods.
We propose a randomized dynamic pricing policy based on a variant of the Online Newton Step algorithm (ONS)
We also present a new optimistic algorithm for the adversarial MNL contextual bandits problem.
arXiv Detail & Related papers (2021-10-19T14:37:10Z) - Policy Optimization Using Semiparametric Models for Dynamic Pricing [1.3428344011390776]
We study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise.
We propose a dynamic statistical learning and decision-making policy that combines semiparametric estimation from a generalized linear model with an unknown link and online decision-making.
arXiv Detail & Related papers (2021-09-13T23:50:01Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - UCB-based Algorithms for Multinomial Logistic Regression Bandits [31.67685495996986]
We use multinomial logit (MNL) to model the probability of each one of $K+1geq 2$ possible outcomes.
We present MNL-UCB, an upper confidence bound (UCB)-based algorithm, that regret achieves $tildemathcalO(dKsqrtT)$ with small dependency on problem-dependent constants.
arXiv Detail & Related papers (2021-03-21T21:09:55Z) - Provably Efficient Reinforcement Learning with Linear Function
Approximation Under Adaptivity Constraints [94.76881135901753]
We consider two popular limited adaptivity models: batch learning model and rare policy switch model.
Our proposed LSVI-UCB-Batch algorithm achieves an $tilde O(sqrtd3H3T + dHT/B)$ regret.
For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$ regret.
arXiv Detail & Related papers (2021-01-06T18:56:07Z) - Fully Gap-Dependent Bounds for Multinomial Logit Bandit [5.132017939561661]
We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items.
We present (i) an algorithm that identifies the optimal assortment $S*$ within $widetildeO(sum_i = 1N Delta_i-2)$ time steps with high probability, and (ii) an algorithm that incurs $O(sum_i notin S* KDelta_i
arXiv Detail & Related papers (2020-11-19T17:52:12Z) - Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs [99.59319332864129]
We show that UCBVI-$gamma$ achieves an $tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps.
In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tildeOmegabig(sqrtSAT/
arXiv Detail & Related papers (2020-10-01T17:57:47Z) - The Influence of Shape Constraints on the Thresholding Bandit Problem [74.73898216415049]
We investigate the Thresholding Bandit problem (TBP) under several shape constraints.
In the TBP problem the aim is to output, at the end of the sequential game, the set of arms whose means are above a given threshold.
We prove that the minimax rates for the regret are (i) $sqrtlog(K)K/T$ for TBP, (ii) $sqrtlog(K)/T$ for UTBP, (iii) $sqrtK/T$ for CTBP and (iv)
arXiv Detail & Related papers (2020-06-17T17:12:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.