Preventing Imitation Learning with Adversarial Policy Ensembles
- URL: http://arxiv.org/abs/2002.01059v2
- Date: Sun, 2 Aug 2020 23:15:58 GMT
- Title: Preventing Imitation Learning with Adversarial Policy Ensembles
- Authors: Albert Zhan, Stas Tiomkin, Pieter Abbeel
- Abstract summary: Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy.
How can we protect against external observers cloning our proprietary policies?
We introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies.
- Score: 79.81807680370677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imitation learning can reproduce policies by observing experts, which poses a
problem regarding policy privacy. Policies, such as human, or policies on
deployed robots, can all be cloned without consent from the owners. How can we
protect against external observers cloning our proprietary policies? To answer
this question we introduce a new reinforcement learning framework, where we
train an ensemble of near-optimal policies, whose demonstrations are guaranteed
to be useless for an external observer. We formulate this idea by a constrained
optimization problem, where the objective is to improve proprietary policies,
and at the same time deteriorate the virtual policy of an eventual external
observer. We design a tractable algorithm to solve this new optimization
problem by modifying the standard policy gradient algorithm. Our formulation
can be interpreted in lenses of confidentiality and adversarial behaviour,
which enables a broader perspective of this work. We demonstrate the existence
of "non-clonable" ensembles, providing a solution to the above optimization
problem, which is calculated by our modified policy gradient algorithm. To our
knowledge, this is the first work regarding the protection of policies in
Reinforcement Learning.
Related papers
- Conservative Exploration for Policy Optimization via Off-Policy Policy
Evaluation [4.837737516460689]
We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy.
We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
arXiv Detail & Related papers (2023-12-24T10:59:32Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Online Learning with Off-Policy Feedback [18.861989132159945]
We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback.
We propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy.
arXiv Detail & Related papers (2022-07-18T21:57:16Z) - Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Policy Supervectors: General Characterization of Agents by their
Behaviour [18.488655590845163]
We propose policy supervectors for characterizing agents by the distribution of states they visit.
Policy supervectors can characterize policies regardless of their design philosophy and scale to thousands of policies on a single workstation machine.
We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning.
arXiv Detail & Related papers (2020-12-02T14:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.