Dynamic pricing and assortment under a contextual MNL demand
- URL: http://arxiv.org/abs/2110.10018v1
- Date: Tue, 19 Oct 2021 14:37:10 GMT
- Title: Dynamic pricing and assortment under a contextual MNL demand
- Authors: Vineet Goyal and Noemie Perivier
- Abstract summary: We consider dynamic multi-product pricing and assortment problems under an unknown demand over T periods.
We propose a randomized dynamic pricing policy based on a variant of the Online Newton Step algorithm (ONS)
We also present a new optimistic algorithm for the adversarial MNL contextual bandits problem.
- Score: 2.1320960069210475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider dynamic multi-product pricing and assortment problems under an
unknown demand over T periods, where in each period, the seller decides on the
price for each product or the assortment of products to offer to a customer who
chooses according to an unknown Multinomial Logit Model (MNL). Such problems
arise in many applications, including online retail and advertising. We propose
a randomized dynamic pricing policy based on a variant of the Online Newton
Step algorithm (ONS) that achieves a $O(d\sqrt{T}\log(T))$ regret guarantee
under an adversarial arrival model. We also present a new optimistic algorithm
for the adversarial MNL contextual bandits problem, which achieves a better
dependency than the state-of-the-art algorithms in a problem-dependent constant
$\kappa$ (potentially exponentially small). Our regret upper bounds scale as
$\tilde{O}(d\sqrt{\kappa T}+ \log(T)/\kappa)$, which gives a significantly
stronger bound than the existing $\tilde{O}(d\sqrt{T}/\kappa)$ guarantees.
Related papers
- Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Multi-step Preference Optimization (MPO) is built upon the natural actor-critic frameworkciteprakhlin2013online,joulani17a.
We show that OMPO requires $mathcalO(epsilon-1)$ policy updates to converge to an $epsilon$-approximate Nash equilibrium.
We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
arXiv Detail & Related papers (2025-02-18T09:33:48Z) - A Primal-Dual Online Learning Approach for Dynamic Pricing of Sequentially Displayed Complementary Items under Sale Constraints [54.46126953873298]
We address the problem of dynamically pricing complementary items that are sequentially displayed to customers.
Coherent pricing policies for complementary items are essential because optimizing the pricing of each item individually is ineffective.
We empirically evaluate our approach using synthetic settings randomly generated from real-world data, and compare its performance in terms of constraints violation and regret.
arXiv Detail & Related papers (2024-07-08T09:55:31Z) - Contextual Dynamic Pricing: Algorithms, Optimality, and Local Differential Privacy Constraints [10.057344315478709]
We study the contextual dynamic pricing problem where a firm sells products to $T$ sequentially arriving consumers.
We first show that the optimal regret upper bound is of order $sqrtdT$, up to a logarithmic factor.
A key insight of our theoretical result is an intrinsic connection between dynamic pricing and the contextual multi-armed bandit problem.
arXiv Detail & Related papers (2024-06-04T15:44:10Z) - The Real Price of Bandit Information in Multiclass Classification [73.17969992976501]
We revisit the classical problem of multiclass classification with bandit feedback.
We present a new bandit classification algorithm that guarantees regret $smashwidetildeO(|H|+sqrtT)$.
arXiv Detail & Related papers (2024-05-16T12:11:09Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - MNL-Bandit with Knapsacks: a near-optimal algorithm [2.3020018305241337]
We consider a dynamic assortment selection problem where a seller has a fixed inventory of $N$ substitutable products.
In each period, the seller needs to decide on the assortment of products to offer to the customers.
We show that when the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a $tildeO(N + sqrtNT)$ regret bound.
arXiv Detail & Related papers (2021-06-02T13:05:34Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Logarithmic Regret in Feature-based Dynamic Pricing [0.0]
Feature-based dynamic pricing is an increasingly popular model of setting prices for highly differentiated products.
We provide two algorithms for infracigen and adversarial feature settings, and prove the optimal $O(dlogT)$ regret bounds for both.
We also prove an $(sqrtT)$ information-theoretic lower bound for a slightly more general setting, which demonstrates that "knowing-the-demand-curve" leads to an exponential improvement in feature-based dynamic pricing.
arXiv Detail & Related papers (2021-02-20T00:45:33Z) - A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit [2.9998316151418107]
We consider a dynamic set optimization problem, where a decision-maker offers a subset of products to a consumer.
We model consumer choice behavior using the widely used Multinomial Logit (MNL) model.
We show that the regret is bounded by $O(sqrtdT + kappa)$, significantly improving the performance over existing methods.
arXiv Detail & Related papers (2020-11-28T00:20:36Z) - Fully Gap-Dependent Bounds for Multinomial Logit Bandit [5.132017939561661]
We study the multinomial logit (MNL) bandit problem, where at each time step, the seller offers an assortment of size at most $K$ from a pool of $N$ items.
We present (i) an algorithm that identifies the optimal assortment $S*$ within $widetildeO(sum_i = 1N Delta_i-2)$ time steps with high probability, and (ii) an algorithm that incurs $O(sum_i notin S* KDelta_i
arXiv Detail & Related papers (2020-11-19T17:52:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.