Queueing Network Controls via Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2008.01644v7
- Date: Tue, 14 Sep 2021 22:27:06 GMT
- Title: Queueing Network Controls via Deep Reinforcement Learning
- Authors: J. G. Dai and Mark Gluzman
- Abstract summary: We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Novel advanced policy gradient (APG) methods, such as Trust Region policy
optimization and Proximal policy optimization (PPO), have become the dominant
reinforcement learning algorithms because of their ease of implementation and
good practical performance. A conventional setup for notoriously difficult
queueing network control problems is a Markov decision problem (MDP) that has
three features: infinite state space, unbounded costs, and long-run average
cost objective. We extend the theoretical framework of these APG methods for
such MDP problems. The resulting PPO algorithm is tested on a parallel-server
system and large-size multiclass queueing networks. The algorithm consistently
generates control policies that outperform state-of-art heuristics in
literature in a variety of load conditions from light to heavy traffic. These
policies are demonstrated to be near-optimal when the optimal policy can be
computed.
A key to the successes of our PPO algorithm is the use of three variance
reduction techniques in estimating the relative value function via sampling.
First, we use a discounted relative value function as an approximation of the
relative value function. Second, we propose regenerative simulation to estimate
the discounted relative value function. Finally, we incorporate the
approximating martingale-process method into the regenerative estimator.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - A safe exploration approach to constrained Markov decision processes [7.036452261968767]
We consider discounted infinite horizon constrained Markov decision processes (CMDPs)
The goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints.
Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing a model-free and simulator-free algorithm.
arXiv Detail & Related papers (2023-12-01T13:16:39Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Processing Network Controls via Deep Reinforcement Learning [0.0]
dissertation is concerned with theoretical justification and practical application of the advanced policy gradient algorithms.
Policy improvement bounds play a crucial role in the theoretical justification of the APG algorithms.
arXiv Detail & Related papers (2022-05-01T04:34:21Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.