Phase Transitions in Learning and Earning under Price Protection
Guarantee
- URL: http://arxiv.org/abs/2211.01798v1
- Date: Thu, 3 Nov 2022 13:36:00 GMT
- Title: Phase Transitions in Learning and Earning under Price Protection
Guarantee
- Authors: Qing Feng and Ruihao Zhu and Stefanus Jasin
- Abstract summary: We study the impact of such policy on the design of online learning algorithm for data-driven dynamic pricing.
We show that the optimal regret is $tildeTheta(sqrtT+minM,,T2/3)$ by first establishing a fundamental impossible regime.
We propose LEAP, a phased exploration type algorithm for underlineLearning and underlineEArning under underlinePrice Protection.
- Score: 4.683806391173103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the prevalence of ``price protection guarantee", which allows a
customer who purchased a product in the past to receive a refund from the
seller during the so-called price protection period (typically defined as a
certain time window after the purchase date) in case the seller decides to
lower the price, we study the impact of such policy on the design of online
learning algorithm for data-driven dynamic pricing with initially unknown
customer demand. We consider a setting where a firm sells a product over a
horizon of $T$ time steps. For this setting, we characterize how the value of
$M$, the length of price protection period, can affect the optimal regret of
the learning process. We show that the optimal regret is
$\tilde{\Theta}(\sqrt{T}+\min\{M,\,T^{2/3}\})$ by first establishing a
fundamental impossible regime with novel regret lower bound instances. Then, we
propose LEAP, a phased exploration type algorithm for \underline{L}earning and
\underline{EA}rning under \underline{P}rice Protection to match this lower
bound up to logarithmic factors or even doubly logarithmic factors (when there
are only two prices available to the seller). Our results reveal the surprising
phase transitions of the optimal regret with respect to $M$. Specifically, when
$M$ is not too large, the optimal regret has no major difference when compared
to that of the classic setting with no price protection guarantee. We also show
that there exists an upper limit on how much the optimal regret can deteriorate
when $M$ grows large. Finally, we conduct extensive numerical experiments to
show the benefit of LEAP over other heuristic methods for this problem.
Related papers
- Improved Algorithms for Contextual Dynamic Pricing [24.530341596901476]
In contextual dynamic pricing, a seller sequentially prices goods based on contextual information.
Our algorithm achieves an optimal regret bound of $tildemathcalO(T2/3)$, improving the existing results.
For this model, our algorithm obtains a regret $tildemathcalO(Td+2beta/d+3beta)$, where $d$ is the dimension of the context space.
arXiv Detail & Related papers (2024-06-17T08:26:51Z) - Dynamic Pricing and Learning with Long-term Reference Effects [16.07344044662994]
We study a simple and novel reference price mechanism where reference price is the average of the past prices offered by the seller.
We show that under this mechanism, a markdown policy is near-optimal irrespective of the parameters of the model.
We then consider a more challenging dynamic pricing and learning problem, where the demand model parameters are apriori unknown.
arXiv Detail & Related papers (2024-02-19T21:36:54Z) - Dynamic Pricing and Learning with Bayesian Persuasion [18.59029578133633]
We consider a novel dynamic pricing and learning setting where in addition to setting prices of products, the seller also ex-ante commits to 'advertising schemes'
We use the popular Bayesian persuasion framework to model the effect of these signals on the buyers' valuation and purchase responses.
We design an online algorithm that can use past purchase responses to adaptively learn the optimal pricing and advertising strategy.
arXiv Detail & Related papers (2023-04-27T17:52:06Z) - Repeated Bilateral Trade Against a Smoothed Adversary [5.939280057673226]
We study repeated bilateral trade where an adaptive $sigma$-smooth adversary generates the valuations of sellers and buyers.
We provide a complete characterization of the regret regimes for fixed-price mechanisms under different feedback models.
arXiv Detail & Related papers (2023-02-21T16:30:10Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.