Online Resource Allocation in Episodic Markov Decision Processes
- URL: http://arxiv.org/abs/2305.10744v3
- Date: Thu, 19 Oct 2023 00:50:39 GMT
- Title: Online Resource Allocation in Episodic Markov Decision Processes
- Authors: Duksang Lee, William Overman, Dabeen Lee
- Abstract summary: We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process.
We propose the observe-then-decide regime and improve the existing decide-then-observe regime.
We show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $tilde O(rho-1H3/2SsqrtAT)$ with high probability.
- Score: 1.8416014644193066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies a long-term resource allocation problem over multiple
periods where each period requires a multi-stage decision-making process. We
formulate the problem as an online allocation problem in an episodic
finite-horizon constrained Markov decision process with an unknown
non-stationary transition function and stochastic non-stationary reward and
resource consumption functions. We propose the observe-then-decide regime and
improve the existing decide-then-observe regime, while the two settings differ
in how the observations and feedback about the reward and resource consumption
functions are given to the decision-maker. We develop an online dual mirror
descent algorithm that achieves near-optimal regret bounds for both settings.
For the observe-then-decide regime, we prove that the expected regret against
the dynamic clairvoyant optimal policy is bounded by $\tilde
O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ where $\rho\in(0,1)$ is the budget parameter,
$H$ is the length of the horizon, $S$ and $A$ are the numbers of states and
actions, and $T$ is the number of episodes. For the decide-then-observe regime,
we show that the regret against the static optimal policy that has access to
the mean reward and mean resource consumption functions is bounded by $\tilde
O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ with high probability. We test the numerical
efficiency of our method for a variant of the resource-constrained inventory
management problem.
Related papers
- Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes.
We propose an algorithm based on Thompson sampling with dynamically-sized episodes.
We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Double Thompson Sampling in Finite stochastic Games [10.559241770674724]
We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process.
We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem.
arXiv Detail & Related papers (2022-02-21T06:11:51Z) - Dynamic Planning and Learning under Recovering Rewards [18.829837839926956]
We propose, construct and prove performance guarantees for a class of "Purely Periodic Policies"
Our framework and policy design may have the potential to be adapted into other offline planning and online learning applications.
arXiv Detail & Related papers (2021-06-28T15:40:07Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Online Stochastic Optimization with Wasserstein Based Non-stationarity [12.91020811577007]
We consider a general online optimization problem with multiple budget constraints over a horizon of finite time periods.
The objective of the decision maker is to maximize the cumulative reward subject to the budget constraints.
This formulation captures a wide range of applications including online linear programming and network revenue management.
arXiv Detail & Related papers (2020-12-13T04:47:37Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.