Deterministic Policies for Constrained Reinforcement Learning in Polynomial-Time
- URL: http://arxiv.org/abs/2405.14183v1
- Date: Thu, 23 May 2024 05:27:51 GMT
- Title: Deterministic Policies for Constrained Reinforcement Learning in Polynomial-Time
- Authors: Jeremy McMahan,
- Abstract summary: We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning problems.
Our work not only provides provably efficient algorithms to address real-world challenges in decision-making but also offers a unifying theory for the efficient computation of constrained deterministic policies.
- Score: 1.223779595809275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Under mild reward assumptions, our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for a diverse class of cost criteria. This class requires that the cost of a policy can be computed recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work not only provides provably efficient algorithms to address real-world challenges in decision-making but also offers a unifying theory for the efficient computation of constrained deterministic policies.
Related papers
- e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings.
We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement
Learning [6.003234406806134]
In this work, we propose the Dirichlet policy for continuous allocation tasks and analyze the bias and variance of its policy gradients.
We demonstrate that the Dirichlet policy is bias-free and provides significantly faster convergence and better performance than the Gaussian-softmax policy.
The experimental results show the potential to prescribe optimal operation, improve the efficiency and sustainability of multi-power source systems.
arXiv Detail & Related papers (2022-01-20T20:41:04Z) - A Reinforcement Learning Approach to the Stochastic Cutting Stock
Problem [0.0]
We propose a formulation of the cutting stock problem as a discounted infinite-horizon decision process.
An optimal solution corresponds to a policy that associates each state with a decision and minimizes the expected total cost.
arXiv Detail & Related papers (2021-09-20T14:47:54Z) - A Policy Efficient Reduction Approach to Convex Constrained Deep
Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method.
Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Uniform-PAC Bounds for Reinforcement Learning with Linear Function
Approximation [92.3161051419884]
We study reinforcement learning with linear function approximation.
Existing algorithms only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees.
We propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability.
arXiv Detail & Related papers (2021-06-22T08:48:56Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.