Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement
Learning
- URL: http://arxiv.org/abs/2104.09122v1
- Date: Mon, 19 Apr 2021 08:21:56 GMT
- Title: Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement
Learning
- Authors: Jie Ren, Yewen Li, Zihan Ding, Wei Pan and Hao Dong
- Abstract summary: We show that grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance.
We propose a probabilistic mixture-of-experts (PMOE) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem.
- Score: 7.020079427649125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep reinforcement learning (DRL) has successfully solved various problems
recently, typically with a unimodal policy representation. However, grasping
distinguishable skills for some tasks with non-unique optima can be essential
for further improving its learning efficiency and performance, which may lead
to a multimodal policy represented as a mixture-of-experts (MOE). To our best
knowledge, present DRL algorithms for general utility do not deploy this method
as policy function approximators due to the potential challenge in its
differentiability for policy learning. In this work, we propose a probabilistic
mixture-of-experts (PMOE) implemented with a Gaussian mixture model (GMM) for
multimodal policy, together with a novel gradient estimator for the
indifferentiability problem, which can be applied in generic off-policy and
on-policy DRL algorithms using stochastic policies, e.g., Soft Actor-Critic
(SAC) and Proximal Policy Optimisation (PPO). Experimental results testify the
advantage of our method over unimodal polices and two different MOE methods, as
well as a method of option frameworks, based on the above two types of DRL
algorithms, on six MuJoCo tasks. Different gradient estimations for GMM like
the reparameterisation trick (Gumbel-Softmax) and the score-ratio trick are
also compared with our method. We further empirically demonstrate the
distinguishable primitives learned with PMOE and show the benefits of our
method in terms of exploration.
Related papers
- Equivariant Diffusion Policy [16.52810213171303]
We propose a novel diffusion policy learning method that leverages domain symmetries to obtain better sample efficiency and generalization in the denoising function.
We evaluate the method empirically on a set of 12 simulation tasks in MimicGen, and show that it obtains a success rate that is, on average, 21.9% higher than the baseline Diffusion Policy.
arXiv Detail & Related papers (2024-07-01T21:23:26Z) - Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies.
DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective.
Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Towards Applicable Reinforcement Learning: Improving the Generalization
and Sample Efficiency with Policy Ensemble [43.95417785185457]
It is challenging for reinforcement learning algorithms to succeed in real-world applications like financial trading and logistic system.
We propose Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner.
EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods.
arXiv Detail & Related papers (2022-05-19T02:25:32Z) - Direct Random Search for Fine Tuning of Deep Reinforcement Learning
Policies [5.543220407902113]
We show that a direct random search is very effective at fine-tuning DRL policies by directly optimizing them using deterministic rollouts.
Our results show that this method yields more consistent and higher performing agents on the environments we tested.
arXiv Detail & Related papers (2021-09-12T20:12:46Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Imitation Learning from MPC for Quadrupedal Multi-Gait Control [63.617157490920505]
We present a learning algorithm for training a single policy that imitates multiple gaits of a walking robot.
We use and extend MPC-Net, which is an Imitation Learning approach guided by Model Predictive Control.
We validate our approach on hardware and show that a single learned policy can replace its teacher to control multiple gaits.
arXiv Detail & Related papers (2021-03-26T08:48:53Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.