Learning to Order for Inventory Systems with Lost Sales and Uncertain
Supplies
- URL: http://arxiv.org/abs/2207.04550v4
- Date: Tue, 31 Oct 2023 06:08:41 GMT
- Title: Learning to Order for Inventory Systems with Lost Sales and Uncertain
Supplies
- Authors: Boxiao Chen, Jiashuo Jiang, Jiawei Zhang and Zhengyuan Zhou
- Abstract summary: We consider a lost-sales inventory control system with a lead time $L$ over a planning horizon $T$. Supply is uncertain, and is a function of the order quantity.
We show that our algorithm achieves a regret (i.e. the performance gap between the cost of our algorithm and that of an optimal policy over $T$ periods) of $O(L+sqrtT)$ when $Lgeqlog(T)$.
- Score: 21.690446677016247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a stochastic lost-sales inventory control system with a lead time
$L$ over a planning horizon $T$. Supply is uncertain, and is a function of the
order quantity (due to random yield/capacity, etc). We aim to minimize the
$T$-period cost, a problem that is known to be computationally intractable even
under known distributions of demand and supply. In this paper, we assume that
both the demand and supply distributions are unknown and develop a
computationally efficient online learning algorithm. We show that our algorithm
achieves a regret (i.e. the performance gap between the cost of our algorithm
and that of an optimal policy over $T$ periods) of $O(L+\sqrt{T})$ when
$L\geq\log(T)$. We do so by 1) showing our algorithm cost is higher by at most
$O(L+\sqrt{T})$ for any $L\geq 0$ compared to an optimal constant-order policy
under complete information (a well-known and widely-used algorithm) and 2)
leveraging its known performance guarantee from the existing literature. To the
best of our knowledge, a finite-sample $O(\sqrt{T})$ (and polynomial in $L$)
regret bound when benchmarked against an optimal policy is not known before in
the online inventory control literature. A key challenge in this learning
problem is that both demand and supply data can be censored; hence only
truncated values are observable. We circumvent this challenge by showing that
the data generated under an order quantity $q^2$ allows us to simulate the
performance of not only $q^2$ but also $q^1$ for all $q^1<q^2$, a key
observation to obtain sufficient information even under data censoring. By
establishing a high probability coupling argument, we are able to evaluate and
compare the performance of different order policies at their steady state
within a finite time horizon. Since the problem lacks convexity, we develop an
active elimination method that adaptively rules out suboptimal solutions.
Related papers
- Demand Balancing in Primal-Dual Optimization for Blind Network Revenue Management [6.72809363581332]
This paper proposes a practically efficient algorithm with optimal theoretical regret which solves the classical network revenue management problem with unknown, nonparametric demand.
A key technical contribution is the so-called demand balancing, which pairs the primal solution (i.e., the price) in each time period with another price to offset the violation of slackness on resource inventory constraints.
arXiv Detail & Related papers (2024-04-06T01:39:51Z) - Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits.
Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path [106.37656068276902]
We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
arXiv Detail & Related papers (2022-10-10T18:34:32Z) - Nearly Optimal Policy Optimization with Stable at Any Time Guarantee [53.155554415415445]
Policy-based method in citetshani 2020optimistic is only $tildeO(sqrtSAH3K + sqrtAH4K)$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $sqrtSH$ gap compared with the information theoretic lower bound $tildeOmega(sqrtSAH
arXiv Detail & Related papers (2021-12-21T01:54:17Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.