Semi-Infinitely Constrained Markov Decision Processes and Efficient
Reinforcement Learning
- URL: http://arxiv.org/abs/2305.00254v1
- Date: Sat, 29 Apr 2023 12:52:38 GMT
- Title: Semi-Infinitely Constrained Markov Decision Processes and Efficient
Reinforcement Learning
- Authors: Liangyu Zhang, Yang Peng, Wenhao Yang and Zhihua Zhang
- Abstract summary: We consider a continuum of constraints instead of a finite number of constraints as in the case of ordinary CMDPs.
We devise two reinforcement learning algorithms for SICMDPs that we call SI-CRL and SI-CPO.
To the best of our knowledge, we are the first to apply tools from semi-infinitely programming (SIP) to solve constrained reinforcement learning problems.
- Score: 17.04643707688075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel generalization of constrained Markov decision processes
(CMDPs) that we call the \emph{semi-infinitely constrained Markov decision
process} (SICMDP). Particularly, we consider a continuum of constraints instead
of a finite number of constraints as in the case of ordinary CMDPs. We also
devise two reinforcement learning algorithms for SICMDPs that we call SI-CRL
and SI-CPO. SI-CRL is a model-based reinforcement learning algorithm. Given an
estimate of the transition model, we first transform the reinforcement learning
problem into a linear semi-infinitely programming (LSIP) problem and then use
the dual exchange method in the LSIP literature to solve it. SI-CPO is a policy
optimization algorithm. Borrowing the ideas from the cooperative stochastic
approximation approach, we make alternative updates to the policy parameters to
maximize the reward or minimize the cost. To the best of our knowledge, we are
the first to apply tools from semi-infinitely programming (SIP) to solve
constrained reinforcement learning problems. We present theoretical analysis
for SI-CRL and SI-CPO, identifying their iteration complexity and sample
complexity. We also conduct extensive numerical examples to illustrate the
SICMDP model and demonstrate that our proposed algorithms are able to solve
complex sequential decision-making tasks leveraging modern deep reinforcement
learning techniques.
Related papers
- Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes.
This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy
Optimization [21.30645601474163]
Original Q-learning suffers from performance and complexity challenges across very large networks.
New model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges.
Numerical results show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity.
arXiv Detail & Related papers (2024-02-08T08:08:23Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Non-stationary Reinforcement Learning under General Function
Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z) - MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion
Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms.
Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return.
We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes.
Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap.
Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z) - Learning with Safety Constraints: Sample Complexity of Reinforcement
Learning for Constrained MDPs [13.922754427601491]
We characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy.
Our main finding is that compared to the best known bounds of the unconstrained regime, the sample of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints.
arXiv Detail & Related papers (2020-08-01T18:17:08Z) - Constrained Combinatorial Optimization with Reinforcement Learning [0.30938904602244344]
This paper presents a framework to tackle constrained optimization problems using deep Reinforcement Learning (RL)
We extend the Neural Combinatorial Optimization (NCO) theory in order to deal with constraints in its formulation.
In that context, the solution is iteratively constructed based on interactions with the environment.
arXiv Detail & Related papers (2020-06-22T03:13:07Z) - Iterative Algorithm Induced Deep-Unfolding Neural Networks: Precoding
Design for Multiuser MIMO Systems [59.804810122136345]
We propose a framework for deep-unfolding, where a general form of iterative algorithm induced deep-unfolding neural network (IAIDNN) is developed.
An efficient IAIDNN based on the structure of the classic weighted minimum mean-square error (WMMSE) iterative algorithm is developed.
We show that the proposed IAIDNN efficiently achieves the performance of the iterative WMMSE algorithm with reduced computational complexity.
arXiv Detail & Related papers (2020-06-15T02:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.