Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement
Learning with Domain Randomization
- URL: http://arxiv.org/abs/2207.14561v2
- Date: Mon, 10 Apr 2023 07:02:05 GMT
- Title: Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement
Learning with Domain Randomization
- Authors: Yuki Kadokawa, Lingwei Zhu, Yoshihisa Tsurumine, Takamitsu Matsubara
- Abstract summary: We propose a sample-efficient method named cyclic policy distillation (CPD)
CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one.
All of the learned local policies are distilled into a global policy for sim-to-real transfers.
- Score: 10.789649934346004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning with domain randomization learns a control policy
in various simulations with randomized physical and sensor model parameters to
become transferable to the real world in a zero-shot setting. However, a huge
number of samples are often required to learn an effective policy when the
range of randomized parameters is extensive due to the instability of policy
updates. To alleviate this problem, we propose a sample-efficient method named
cyclic policy distillation (CPD). CPD divides the range of randomized
parameters into several small sub-domains and assigns a local policy to each
one. Then local policies are learned while cyclically transitioning to
sub-domains. CPD accelerates learning through knowledge transfer based on
expected performance improvements. Finally, all of the learned local policies
are distilled into a global policy for sim-to-real transfers. CPD's
effectiveness and sample efficiency are demonstrated through simulations with
four tasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah from
Mujoco), and a real-robot, ball-dispersal task. We published code and videos
from our experiments at
https://github.com/yuki-kadokawa/cyclic-policy-distillation.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - BayRnTune: Adaptive Bayesian Domain Randomization via Strategic
Fine-tuning [30.753772054098526]
Domain randomization (DR) entails training a policy with randomized dynamics.
BayRnTune aims to significantly accelerate the learning processes by fine-tuning from previously learned policy.
arXiv Detail & Related papers (2023-10-16T17:32:23Z) - Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world.
However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots.
One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Dimensionality Reduction and Prioritized Exploration for Policy Search [29.310742141970394]
Black-box policy optimization is a class of reinforcement learning algorithms that explores and updates the policies at the parameter level.
We present a novel method to prioritize the exploration of effective parameters and cope with full covariance matrix updates.
Our algorithm learns faster than recent approaches and requires fewer samples to achieve state-of-the-art results.
arXiv Detail & Related papers (2022-03-09T15:17:09Z) - Uncertainty Aware System Identification with Universal Policies [45.44896435487879]
Sim2real transfer is concerned with transferring policies trained in simulation to potentially noisy real world environments.
We propose Uncertainty-aware policy search (UncAPS), where we use Universal Policy Network (UPN) to store simulation-trained task-specific policies.
We then employ robust Bayesian optimisation to craft robust policies for the given environment by combining relevant UPN policies in a DR like fashion.
arXiv Detail & Related papers (2022-02-11T18:27:23Z) - Bingham Policy Parameterization for 3D Rotations in Reinforcement
Learning [95.00518278458908]
We propose a new policy parameterization for representing 3D rotations during reinforcement learning.
Our proposed Bingham Policy parameterization (BPP) models the Bingham distribution and allows for better rotation prediction.
We evaluate BPP on the rotation Wahba problem task, as well as a set of vision-based next-best pose robot manipulation tasks from RLBench.
arXiv Detail & Related papers (2022-02-08T16:09:02Z) - Data-efficient Domain Randomization with Bayesian Optimization [34.854609756970305]
When learning policies for robot control, the required real-world data is typically prohibitively expensive to acquire.
BayRn is a black-box sim-to-real algorithm that solves tasks efficiently by adapting the domain parameter distribution.
Our results show that BayRn is able to perform sim-to-real transfer, while significantly reducing the required prior knowledge.
arXiv Detail & Related papers (2020-03-05T07:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.