Diverse Policy Optimization for Structured Action Space
- URL: http://arxiv.org/abs/2302.11917v1
- Date: Thu, 23 Feb 2023 10:48:09 GMT
- Title: Diverse Policy Optimization for Structured Action Space
- Authors: Wenhao Li, Baoxiang Wang, Shanchao Yang and Hongyuan Zha
- Abstract summary: We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
- Score: 59.361076277997704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enhancing the diversity of policies is beneficial for robustness,
exploration, and transfer in reinforcement learning (RL). In this paper, we aim
to seek diverse policies in an under-explored setting, namely RL tasks with
structured action spaces with the two properties of composability and local
dependencies. The complex action structure, non-uniform reward landscape, and
subtle hyperparameter tuning due to the properties of structured actions
prevent existing approaches from scaling well. We propose a simple and
effective RL method, Diverse Policy Optimization (DPO), to model the policies
in structured action space as the energy-based models (EBM) by following the
probabilistic RL framework. A recently proposed novel and powerful generative
model, GFlowNet, is introduced as the efficient, diverse EBM-based policy
sampler. DPO follows a joint optimization framework: the outer layer uses the
diverse policies sampled by the GFlowNet to update the EBM-based policies,
which supports the GFlowNet training in the inner layer. Experiments on ATSC
and Battle benchmarks demonstrate that DPO can efficiently discover
surprisingly diverse policies in challenging scenarios and substantially
outperform existing state-of-the-art methods.
Related papers
- Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.
DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.
We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field.
We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z) - Theoretically Guaranteed Policy Improvement Distilled from Model-Based
Planning [64.10794426777493]
Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks.
Recent practices tend to distill optimized action sequences into an RL policy during the training phase.
We develop an approach to distill from model-based planning to the policy.
arXiv Detail & Related papers (2023-07-24T16:52:31Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Robust Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
In continuous action domains, parameterized distribution of action distribution allows easy control of exploration.
In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution.
We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym.
arXiv Detail & Related papers (2022-12-14T22:43:56Z) - Towards Applicable Reinforcement Learning: Improving the Generalization
and Sample Efficiency with Policy Ensemble [43.95417785185457]
It is challenging for reinforcement learning algorithms to succeed in real-world applications like financial trading and logistic system.
We propose Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner.
EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods.
arXiv Detail & Related papers (2022-05-19T02:25:32Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.