Robust Policy Optimization in Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2212.07536v1
- Date: Wed, 14 Dec 2022 22:43:56 GMT
- Title: Robust Policy Optimization in Deep Reinforcement Learning
- Authors: Md Masudur Rahman and Yexiang Xue
- Abstract summary: In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
- Score: 16.999444076456268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The policy gradient method enjoys the simplicity of the objective where the
agent optimizes the cumulative reward directly. Moreover, in the continuous
action domain, parameterized distribution of action distribution allows easy
control of exploration, resulting from the variance of the representing
distribution. Entropy can play an essential role in policy optimization by
selecting the stochastic policy, which eventually helps better explore the
environment in reinforcement learning (RL). However, the stochasticity often
reduces as the training progresses; thus, the policy becomes less exploratory.
Additionally, certain parametric distributions might only work for some
environments and require extensive hyperparameter tuning. This paper aims to
mitigate these issues. In particular, we propose an algorithm called Robust
Policy Optimization (RPO), which leverages a perturbed distribution. We
hypothesize that our method encourages high-entropy actions and provides a way
to represent the action space better. We further provide empirical evidence to
verify our hypothesis. We evaluated our methods on various continuous control
tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed
that in many settings, RPO increases the policy entropy early in training and
then maintains a certain level of entropy throughout the training period.
Eventually, our agent RPO shows consistently improved performance compared to
PPO and other techniques: entropy regularization, different distributions, and
data augmentation. Furthermore, in several settings, our method stays robust in
performance, while other baseline mechanisms fail to improve and even worsen
the performance.
Related papers
- ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm [2.6813717321945103]
We propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization.<n>Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks.
arXiv Detail & Related papers (2026-02-10T12:29:57Z) - SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation [54.537828696303286]
In unsupervised-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions.<n>We focus on state entropy (SEM), where the goal is to learn a policy that maximizes the entropy of the state stationary distribution.<n>We introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset.
arXiv Detail & Related papers (2025-12-10T19:50:21Z) - Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning [49.92803982100042]
We propose using the entropy ratio between the current and previous policies as a new global metric.<n>We introduce an textbfEntropy Ratio (ERC) mechanism that imposes bidirectional constraints on the entropy ratio.<n>This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.
arXiv Detail & Related papers (2025-12-05T10:26:32Z) - Polychromic Objectives for Reinforcement Learning [63.37185057794815]
Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks.<n>We introduce an objective for policy methods that explicitly enforces the exploration and refinement of diverse generations.<n>We show how proximal policy optimization (PPO) can be adapted to optimize this objective.
arXiv Detail & Related papers (2025-09-29T19:32:11Z) - Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.
DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.
We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Adversarial Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
The policy represented by the deep neural network can overfitting, which hamper a reinforcement learning agent from learning effective policy.
Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting.
We propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy.
arXiv Detail & Related papers (2023-04-27T21:01:08Z) - Entropy Augmented Reinforcement Learning [0.0]
We propose a shifted Markov decision process (MDP) to encourage the exploration and reinforce the ability of escaping from suboptimums.
Our experiments test augmented TRPO and PPO on MuJoCo benchmark tasks, of an indication that the agent is heartened towards higher reward regions.
arXiv Detail & Related papers (2022-08-19T13:09:32Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control.
From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly.
We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.