A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement
Learning with Provable Convergence
- URL: http://arxiv.org/abs/2306.06402v1
- Date: Sat, 10 Jun 2023 10:04:54 GMT
- Title: A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement
Learning with Provable Convergence
- Authors: Kexuan Wang, An Liu, and Baishuo Liu
- Abstract summary: Deep Actorritic algorithms combine Actorritic with deep neural network (DNN)
In this paper, we propose a single-loop Actor-Critic algorithm for general interaction.
We show that the SL-Critic algorithm converges with a superior learning approximation with superior performance.
- Score: 8.191815417711194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abstract -- Deep Actor-Critic algorithms, which combine Actor-Critic with
deep neural network (DNN), have been among the most prevalent reinforcement
learning algorithms for decision-making problems in simulated environments.
However, the existing deep Actor-Critic algorithms are still not mature to
solve realistic problems with non-convex stochastic constraints and high cost
to interact with the environment. In this paper, we propose a single-loop deep
Actor-Critic (SLDAC) algorithmic framework for general constrained
reinforcement learning (CRL) problems. In the actor step, the constrained
stochastic successive convex approximation (CSSCA) method is applied to handle
the non-convex stochastic objective and constraints. In the critic step, the
critic DNNs are only updated once or a few finite times for each iteration,
which simplifies the algorithm to a single-loop framework (the existing works
require a sufficient number of updates for the critic step to ensure a good
enough convergence of the inner loop for each iteration). Moreover, the
variance of the policy gradient estimation is reduced by reusing observations
from the old policy. The single-loop design and the observation reuse
effectively reduce the agent-environment interaction cost and computational
complexity. In spite of the biased policy gradient estimation incurred by the
single-loop design and observation reuse, we prove that the SLDAC with a
feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the
original problem almost surely. Simulations show that the SLDAC algorithm can
achieve superior performance with much lower interaction cost.
Related papers
- Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning [20.491176017183044]
This paper tackles the multi-objective reinforcement learning (MORL) problem.
It introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals.
arXiv Detail & Related papers (2024-05-05T23:52:57Z) - On the Global Convergence of Natural Actor-Critic with Two-layer Neural
Network Parametrization [38.32265770020665]
We study a natural actor-critic algorithm that utilizes neural networks to represent the critic.
Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics.
arXiv Detail & Related papers (2023-06-18T06:22:04Z) - Solving Continuous Control via Q-learning [54.05120662838286]
We show that a simple modification of deep Q-learning largely alleviates issues with actor-critic methods.
By combining bang-bang action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods.
arXiv Detail & Related papers (2022-10-22T22:55:50Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic
Convex Optimization [4.970364068620608]
We exploit convexity and L-smoothness to improve the noisy estimates outputted by the gradient oracle.
We show that increasing the number and proximity of the queried points leads to better gradient estimates.
We also apply COCO in vanilla settings by plugging it in existing algorithms, such as SGD, Adam or STRSAGA.
arXiv Detail & Related papers (2021-09-07T17:21:09Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Exact Asymptotics for Linear Quadratic Adaptive Control [6.287145010885044]
We study the simplest non-bandit reinforcement learning problem: linear quadratic control (LQAC)
We derive expressions for the regret, estimation error, and prediction error of a stepwise-updating LQAC algorithm.
In simulations on both stable and unstable systems, we find that our theory also describes the algorithm's finite-sample behavior remarkably well.
arXiv Detail & Related papers (2020-11-02T22:43:30Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.