Constant Regret Re-solving Heuristics for Price-based Revenue Management
- URL: http://arxiv.org/abs/2009.02861v2
- Date: Tue, 29 Dec 2020 20:01:20 GMT
- Title: Constant Regret Re-solving Heuristics for Price-based Revenue Management
- Authors: Yining Wang and He Wang
- Abstract summary: We show that a natural re-solving attains $O(1)$ regret compared to the value of the optimal policy.
We also prove that there is an $Omega(ln T)$ gap between the value of the optimal policy and that of the fluid model.
- Score: 17.344365325417602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Price-based revenue management is an important problem in operations
management with many practical applications. The problem considers a retailer
who sells a product (or multiple products) over $T$ consecutive time periods
and is subject to constraints on the initial inventory levels. While the
optimal pricing policy could be obtained via dynamic programming, such an
approach is sometimes undesirable because of high computational costs.
Approximate policies, such as the re-solving heuristics, are often applied as
computationally tractable alternatives. In this paper, we show the following
two results. First, we prove that a natural re-solving heuristic attains $O(1)$
regret compared to the value of the optimal policy. This improves the $O(\ln
T)$ regret upper bound established in the prior work of
\cite{jasin2014reoptimization}. Second, we prove that there is an $\Omega(\ln
T)$ gap between the value of the optimal policy and that of the fluid model.
This complements our upper bound result by showing that the fluid is not an
adequate information-relaxed benchmark when analyzing price-based revenue
management algorithms.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Learning with Posterior Sampling for Revenue Management under Time-varying Demand [36.22276574805786]
We discuss the revenue management problem to maximize revenue by pricing items or services.
One challenge in this problem is that the demand distribution is unknown and varies over time in real applications such as airline and retail industries.
arXiv Detail & Related papers (2024-05-08T09:28:26Z) - Demand Balancing in Primal-Dual Optimization for Blind Network Revenue Management [6.72809363581332]
This paper proposes a practically efficient algorithm with optimal theoretical regret which solves the classical network revenue management problem with unknown, nonparametric demand.
A key technical contribution is the so-called demand balancing, which pairs the primal solution (i.e., the price) in each time period with another price to offset the violation of slackness on resource inventory constraints.
arXiv Detail & Related papers (2024-04-06T01:39:51Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Best of Both Worlds Policy Optimization [33.13041034490332]
We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial.
This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
arXiv Detail & Related papers (2023-02-18T19:46:11Z) - Learning to Order for Inventory Systems with Lost Sales and Uncertain
Supplies [21.690446677016247]
We consider a lost-sales inventory control system with a lead time $L$ over a planning horizon $T$. Supply is uncertain, and is a function of the order quantity.
We show that our algorithm achieves a regret (i.e. the performance gap between the cost of our algorithm and that of an optimal policy over $T$ periods) of $O(L+sqrtT)$ when $Lgeqlog(T)$.
arXiv Detail & Related papers (2022-07-10T22:11:32Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation [107.06364966905821]
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
arXiv Detail & Related papers (2020-07-13T16:30:47Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.