ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for
Last-Iterate Convergence in Constrained MDPs
- URL: http://arxiv.org/abs/2302.01275v1
- Date: Thu, 2 Feb 2023 18:05:27 GMT
- Title: ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for
Last-Iterate Convergence in Constrained MDPs
- Authors: Ted Moskovitz, Brendan O'Donoghue, Vivek Veeriah, Sebastian
Flennerhag, Satinder Singh, Tom Zahavy
- Abstract summary: Reinforcement Learning (RL) has been applied to real-world problems with increasing success.
We introduce Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD)
- Score: 31.663072540757643
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In recent years, Reinforcement Learning (RL) has been applied to real-world
problems with increasing success. Such applications often require to put
constraints on the agent's behavior. Existing algorithms for constrained RL
(CRL) rely on gradient descent-ascent, but this approach comes with a caveat.
While these algorithms are guaranteed to converge on average, they do not
guarantee last-iterate convergence, i.e., the current policy of the agent may
never converge to the optimal solution. In practice, it is often observed that
the policy alternates between satisfying the constraints and maximizing the
reward, rarely accomplishing both objectives simultaneously. Here, we address
this problem by introducing Reinforcement Learning with Optimistic
Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate
convergence. We demonstrate its empirical effectiveness on a wide variety of
CRL problems including discrete MDPs and continuous control. In the process we
establish a benchmark of challenging CRL problems.
Related papers
- Robust off-policy Reinforcement Learning via Soft Constrained Adversary [0.7583052519127079]
We introduce an f-divergence constrained problem with the prior knowledge distribution.
We derive two typical attacks and their corresponding robust learning frameworks.
Results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.
arXiv Detail & Related papers (2024-08-31T11:13:33Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - A Connection between One-Step Regularization and Critic Regularization
in Reinforcement Learning [163.44116192806922]
One-step methods perform regularization by doing just a single step of policy improvement.
critic regularization methods do many steps of policy improvement with a regularized objective.
Applying a multi-step critic regularization method with a regularization coefficient of 1 iteration yields the same policy as one-step RL.
arXiv Detail & Related papers (2023-07-24T17:46:32Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Optimal Conservative Offline RL with General Function Approximation via
Augmented Lagrangian [18.2080757218886]
offline reinforcement learning (RL) refers to decision-making from a previously-collected dataset of interactions.
We present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability.
arXiv Detail & Related papers (2022-11-01T19:28:48Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - Combining Pessimism with Optimism for Robust and Efficient Model-Based
Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time.
To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations.
We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.