Safety-Constrained Policy Transfer with Successor Features
- URL: http://arxiv.org/abs/2211.05361v1
- Date: Thu, 10 Nov 2022 06:06:36 GMT
- Title: Safety-Constrained Policy Transfer with Successor Features
- Authors: Zeyu Feng, Bowen Zhang, Jianxin Bi, Harold Soh
- Abstract summary: We propose a Constrained Markov Decision Process (CMDP) formulation that enables the transfer of policies and adherence to safety constraints.
Our approach relies on a novel extension of generalized policy improvement to constrained settings via a Lagrangian formulation.
Our experiments in simulated domains show that our approach is effective; it visits unsafe states less frequently and outperforms alternative state-of-the-art methods when taking safety constraints into account.
- Score: 19.754549649781644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we focus on the problem of safe policy transfer in
reinforcement learning: we seek to leverage existing policies when learning a
new task with specified constraints. This problem is important for
safety-critical applications where interactions are costly and unconstrained
policies can lead to undesirable or dangerous outcomes, e.g., with physical
robots that interact with humans. We propose a Constrained Markov Decision
Process (CMDP) formulation that simultaneously enables the transfer of policies
and adherence to safety constraints. Our formulation cleanly separates task
goals from safety considerations and permits the specification of a wide
variety of constraints. Our approach relies on a novel extension of generalized
policy improvement to constrained settings via a Lagrangian formulation. We
devise a dual optimization algorithm that estimates the optimal dual variable
of a target task, thus enabling safe transfer of policies derived from
successor features learned on source tasks. Our experiments in simulated
domains show that our approach is effective; it visits unsafe states less
frequently and outperforms alternative state-of-the-art methods when taking
safety constraints into account.
Related papers
- Flipping-based Policy for Chance-Constrained Markov Decision Processes [9.404184937255694]
This paper proposes a textitflipping-based policy for Chance-Constrained Markov Decision Processes ( CCMDPs)
The flipping-based policy selects the next action by tossing a potentially distorted coin between two action candidates.
We demonstrate that the flipping-based policy can improve the performance of the existing safe RL algorithms under the same limits of safety constraints.
arXiv Detail & Related papers (2024-10-09T02:00:39Z) - Safe and Balanced: A Framework for Constrained Multi-Objective Reinforcement Learning [26.244121960815907]
We propose a primal-based framework that orchestrates policy optimization between multi-objective learning and constraint adherence.
Our method employs a novel natural policy gradient manipulation method to optimize multiple RL objectives.
Empirically, our proposed method also outperforms prior state-of-the-art methods on challenging safe multi-objective reinforcement learning tasks.
arXiv Detail & Related papers (2024-05-26T00:42:10Z) - Policy Bifurcation in Safe Reinforcement Learning [35.75059015441807]
In some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations.
We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of bifurcation in safe RL.
We propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output.
arXiv Detail & Related papers (2024-03-19T15:54:38Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process.
A key contribution of our approach is to translate cumulative cost constraints into state-based constraints.
We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z) - Deep Constrained Q-learning [15.582910645906145]
In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a set of constraints.
We propose Constrained Q-learning, a novel off-policy reinforcement learning framework restricting the action space directly in the Q-update to learn the optimal Q-function for the induced constrained MDP and the corresponding safe policy.
arXiv Detail & Related papers (2020-03-20T17:26:03Z) - Preventing Imitation Learning with Adversarial Policy Ensembles [79.81807680370677]
Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy.
How can we protect against external observers cloning our proprietary policies?
We introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies.
arXiv Detail & Related papers (2020-01-31T01:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.