Imitate the Good and Avoid the Bad: An Incremental Approach to Safe
Reinforcement Learning
- URL: http://arxiv.org/abs/2312.10385v3
- Date: Wed, 13 Mar 2024 14:48:36 GMT
- Title: Imitate the Good and Avoid the Bad: An Incremental Approach to Safe
Reinforcement Learning
- Authors: Huy Hoang and Tien Mai and Pradeep Varakantham
- Abstract summary: Constrained RL is a framework for enforcing safe actions in Reinforcement Learning.
Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem.
We present an approach that does not modify the trajectory based cost constraint and instead imitates good'' trajectories.
- Score: 13.112202426665466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A popular framework for enforcing safe actions in Reinforcement Learning (RL)
is Constrained RL, where trajectory based constraints on expected cost (or
other cost measures) are employed to enforce safety and more importantly these
constraints are enforced while maximizing expected reward. Most recent
approaches for solving Constrained RL convert the trajectory based cost
constraint into a surrogate problem that can be solved using minor
modifications to RL methods. A key drawback with such approaches is an over or
underestimation of the cost constraint at each state. Therefore, we provide an
approach that does not modify the trajectory based cost constraint and instead
imitates ``good'' trajectories and avoids ``bad'' trajectories generated from
incrementally improving policies. We employ an oracle that utilizes a reward
threshold (which is varied with learning) and the overall cost constraint to
label trajectories as ``good'' or ``bad''. A key advantage of our approach is
that we are able to work from any starting policy or set of trajectories and
improve on it. In an exhaustive set of experiments, we demonstrate that our
approach is able to outperform top benchmark approaches for solving Constrained
RL problems, with respect to expected cost, CVaR cost, or even unknown cost
constraints.
Related papers
- Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning [57.154674117714265]
We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy.
We empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
arXiv Detail & Related papers (2024-03-08T15:30:58Z) - Off-Policy Primal-Dual Safe Reinforcement Learning [16.918188277722503]
We show that the error in cumulative cost estimation causes significant underestimation of cost when using off-policy methods.
We propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in estimation.
We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty.
arXiv Detail & Related papers (2024-01-26T10:33:38Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Handling Long and Richly Constrained Tasks through Constrained
Hierarchical Reinforcement Learning [20.280636126917614]
Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories.
We propose a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent with a low-level goal conditioned RL agent.
A major advantage of CoSHRL is that it can handle constraints on the cost value distribution and can adjust to flexible constraint thresholds without retraining.
arXiv Detail & Related papers (2023-02-21T12:57:12Z) - Solving Richly Constrained Reinforcement Learning through State
Augmentation and Reward Penalties [8.86470998648085]
Key challenge is handling expected cost accumulated using the policy.
Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions.
We provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties.
arXiv Detail & Related papers (2023-01-27T08:33:08Z) - AutoCost: Evolving Intrinsic Cost for Zero-violation Reinforcement
Learning [3.4806267677524896]
We propose AutoCost, a framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance.
We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs.
arXiv Detail & Related papers (2023-01-24T22:51:29Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint.
We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it.
Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z) - Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process.
A key contribution of our approach is to translate cumulative cost constraints into state-based constraints.
We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.