Related papers: Preventing Imitation Learning with Adversarial Policy Ensembles

Preventing Imitation Learning with Adversarial Policy Ensembles

URL: http://arxiv.org/abs/2002.01059v2
Date: Sun, 2 Aug 2020 23:15:58 GMT
Title: Preventing Imitation Learning with Adversarial Policy Ensembles
Authors: Albert Zhan, Stas Tiomkin, Pieter Abbeel
Abstract summary: Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy. How can we protect against external observers cloning our proprietary policies? We introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies.
Score: 79.81807680370677
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy. Policies, such as human, or policies on deployed robots, can all be cloned without consent from the owners. How can we protect against external observers cloning our proprietary policies? To answer this question we introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies, whose demonstrations are guaranteed to be useless for an external observer. We formulate this idea by a constrained optimization problem, where the objective is to improve proprietary policies, and at the same time deteriorate the virtual policy of an eventual external observer. We design a tractable algorithm to solve this new optimization problem by modifying the standard policy gradient algorithm. Our formulation can be interpreted in lenses of confidentiality and adversarial behaviour, which enables a broader perspective of this work. We demonstrate the existence of "non-clonable" ensembles, providing a solution to the above optimization problem, which is calculated by our modified policy gradient algorithm. To our knowledge, this is the first work regarding the protection of policies in Reinforcement Learning.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation [4.837737516460689]
We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems.
arXiv Detail & Related papers (2023-12-24T10:59:32Z)
Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change. We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
Offline Reinforcement Learning with Closed-Form Policy Improvement Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. In this paper, we propose our closed-form policy improvement operators. We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z)
Online Learning with Off-Policy Feedback [18.861989132159945]
We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. We propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy.
arXiv Detail & Related papers (2022-07-18T21:57:16Z)
Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning. We form a second trust region through the construction of another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Policy Supervectors: General Characterization of Agents by their Behaviour [18.488655590845163]
We propose policy supervectors for characterizing agents by the distribution of states they visit. Policy supervectors can characterize policies regardless of their design philosophy and scale to thousands of policies on a single workstation machine. We demonstrate method's applicability by studying the evolution of policies during reinforcement learning, evolutionary training and imitation learning.
arXiv Detail & Related papers (2020-12-02T14:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.