Adversarial Policy Optimization in Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2304.14533v1
- Date: Thu, 27 Apr 2023 21:01:08 GMT
- Title: Adversarial Policy Optimization in Deep Reinforcement Learning
- Authors: Md Masudur Rahman and Yexiang Xue
- Abstract summary: The policy represented by the deep neural network can overfitting, which hamper a reinforcement learning agent from learning effective policy.
Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting.
We propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy.
- Score: 16.999444076456268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The policy represented by the deep neural network can overfit the spurious
features in observations, which hamper a reinforcement learning agent from
learning effective policy. This issue becomes severe in high-dimensional state,
where the agent struggles to learn a useful policy. Data augmentation can
provide a performance boost to RL agents by mitigating the effect of
overfitting. However, such data augmentation is a form of prior knowledge, and
naively applying them in environments might worsen an agent's performance. In
this paper, we propose a novel RL algorithm to mitigate the above issue and
improve the efficiency of the learned policy. Our approach consists of a
max-min game theoretic objective where a perturber network modifies the state
to maximize the agent's probability of taking a different action while
minimizing the distortion in the state. In contrast, the policy network updates
its parameters to minimize the effect of perturbation while maximizing the
expected future reward. Based on this objective, we propose a practical deep
reinforcement learning algorithm, Adversarial Policy Optimization (APO). Our
method is agnostic to the type of policy optimization, and thus data
augmentation can be incorporated to harness the benefit. We evaluated our
approaches on several DeepMind Control robotic environments with
high-dimensional and noisy state settings. Empirical results demonstrate that
our method APO consistently outperforms the state-of-the-art on-policy PPO
agent. We further compare our method with state-of-the-art data augmentation,
RAD, and regularization-based approach DRAC. Our agent APO shows better
performance compared to these baselines.
Related papers
- Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation [1.079960007119637]
OPS-DeMo is an online algorithm that employs dynamic error decay to detect changes in opponents' policies.
Our approach outperforms PPO-trained models in dynamic scenarios like the Predator-Prey setting.
arXiv Detail & Related papers (2024-06-10T17:34:44Z) - Reflective Policy Optimization [20.228281670899204]
Reflective Policy Optimization (RPO) amalgamates past and future state-action information for policy optimization.
RPO empowers the agent for introspection, allowing modifications to its actions within the current state.
Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks.
arXiv Detail & Related papers (2024-06-06T01:46:49Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Adversarial Style Transfer for Robust Policy Optimization in Deep
Reinforcement Learning [13.652106087606471]
This paper proposes an algorithm that aims to improve generalization for reinforcement learning agents by removing overfitting to confounding features.
A policy network updates its parameters to minimize the effect of such perturbations, thus staying robust while maximizing the expected future reward.
We evaluate our approach on Procgen and Distracting Control Suite for generalization and sample efficiency.
arXiv Detail & Related papers (2023-08-29T18:17:35Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z) - Robust Deep Reinforcement Learning against Adversarial Perturbations on
State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises.
Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions.
We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.