Related papers: Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization

Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization

URL: http://arxiv.org/abs/2507.03372v1
Date: Fri, 04 Jul 2025 08:11:15 GMT
Title: Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization
Authors: Buqing Nie, Yangqing Fu, Jingtian Ji, Yue Gao,
Abstract summary: Reinforcement Learning (RL) has achieved remarkable success in sequential decision tasks.<n>Recent studies have revealed the vulnerability of RL policies to different perturbations.<n>We introduce a novel framework called Optimal Adversary-aware Policy Iteration (OA-PI)
Score: 7.680706215758518
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) has achieved remarkable success in sequential decision tasks. However, recent studies have revealed the vulnerability of RL policies to different perturbations, raising concerns about their effectiveness and safety in real-world applications. In this work, we focus on the robustness of RL policies against action perturbations and introduce a novel framework called Optimal Adversary-aware Policy Iteration (OA-PI). Our framework enhances action robustness under various perturbations by evaluating and improving policy performance against the corresponding optimal adversaries. Besides, our approach can be integrated into mainstream DRL algorithms such as Twin Delayed DDPG (TD3) and Proximal Policy Optimization (PPO), improving action robustness effectively while maintaining nominal performance and sample efficiency. Experimental results across various environments demonstrate that our method enhances robustness of DRL policies against different action adversaries effectively.

Related papers

Enhanced Penalty-based Bidirectional Reinforcement Learning Algorithms [4.197448156583907]
We introduce a bidirectional learning approach that enables agents to learn from both initial and terminal states.<n>Our proposed Penalty-Based Bidirectional methodology is tested against Mani skill benchmark environments.<n>The findings indicate that this integrated strategy enhances policy learning, adaptability, and overall performance in challenging scenarios.
arXiv Detail & Related papers (2025-04-04T04:43:07Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Absolute Policy Optimization [10.503953488401926]
We introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games.
arXiv Detail & Related papers (2023-10-20T02:40:05Z)
Efficient Action Robust Reinforcement Learning with Probabilistic Policy Execution Uncertainty [43.55450683502937]
In this paper, we focus on action robust RL with the probabilistic policy execution uncertainty. We establish the existence of an optimal policy on the action robust MDPs with probabilistic policy execution uncertainty. We also develop Action Robust Reinforcement Learning with Certificates (ARRLC) algorithm that minimax optimal regret and sample complexity.
arXiv Detail & Related papers (2023-07-15T00:26:51Z)
Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM) A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z)
Evolutionary Action Selection for Gradient-based Policy Learning [6.282299638495976]
Evolutionary algorithms (EAs) and Deep Reinforcement Learning (DRL) have recently been combined to integrate the advantages of the two solutions for better policy learning. We propose Evolutionary Action Selection-Twin Delayed Deep Deterministic Policy Gradient (EAS-TD3), a novel combination of EA and DRL.
arXiv Detail & Related papers (2022-01-12T03:31:21Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class. We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies. Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z)
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises. Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions. We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.