Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement
Learning with Domain Randomization
- URL: http://arxiv.org/abs/2207.14561v2
- Date: Mon, 10 Apr 2023 07:02:05 GMT
- Title: Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement
Learning with Domain Randomization
- Authors: Yuki Kadokawa, Lingwei Zhu, Yoshihisa Tsurumine, Takamitsu Matsubara
- Abstract summary: We propose a sample-efficient method named cyclic policy distillation (CPD)
CPD divides the range of randomized parameters into several small sub-domains and assigns a local policy to each one.
All of the learned local policies are distilled into a global policy for sim-to-real transfers.
- Score: 10.789649934346004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep reinforcement learning with domain randomization learns a control policy
in various simulations with randomized physical and sensor model parameters to
become transferable to the real world in a zero-shot setting. However, a huge
number of samples are often required to learn an effective policy when the
range of randomized parameters is extensive due to the instability of policy
updates. To alleviate this problem, we propose a sample-efficient method named
cyclic policy distillation (CPD). CPD divides the range of randomized
parameters into several small sub-domains and assigns a local policy to each
one. Then local policies are learned while cyclically transitioning to
sub-domains. CPD accelerates learning through knowledge transfer based on
expected performance improvements. Finally, all of the learned local policies
are distilled into a global policy for sim-to-real transfers. CPD's
effectiveness and sample efficiency are demonstrated through simulations with
four tasks (Pendulum from OpenAIGym and Pusher, Swimmer, and HalfCheetah from
Mujoco), and a real-robot, ball-dispersal task. We published code and videos
from our experiments at
https://github.com/yuki-kadokawa/cyclic-policy-distillation.
Related papers
- Autonomous Vehicle Controllers From End-to-End Differentiable Simulation [60.05963742334746]
We propose a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers.
Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of environment dynamics serve as a useful prior to help the agent learn a more grounded policy.
We find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.
arXiv Detail & Related papers (2024-09-12T11:50:06Z) - Diffusion Policy Policy Optimization [37.04382170999901]
Diffusion Policy Optimization, DPPO, is an algorithmic framework for fine-tuning diffusion-based policies.
DPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks.
We show that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization.
arXiv Detail & Related papers (2024-09-01T02:47:50Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - BayRnTune: Adaptive Bayesian Domain Randomization via Strategic
Fine-tuning [30.753772054098526]
Domain randomization (DR) entails training a policy with randomized dynamics.
BayRnTune aims to significantly accelerate the learning processes by fine-tuning from previously learned policy.
arXiv Detail & Related papers (2023-10-16T17:32:23Z) - Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world.
However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots.
One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Uncertainty Aware System Identification with Universal Policies [45.44896435487879]
Sim2real transfer is concerned with transferring policies trained in simulation to potentially noisy real world environments.
We propose Uncertainty-aware policy search (UncAPS), where we use Universal Policy Network (UPN) to store simulation-trained task-specific policies.
We then employ robust Bayesian optimisation to craft robust policies for the given environment by combining relevant UPN policies in a DR like fashion.
arXiv Detail & Related papers (2022-02-11T18:27:23Z) - Bingham Policy Parameterization for 3D Rotations in Reinforcement
Learning [95.00518278458908]
We propose a new policy parameterization for representing 3D rotations during reinforcement learning.
Our proposed Bingham Policy parameterization (BPP) models the Bingham distribution and allows for better rotation prediction.
We evaluate BPP on the rotation Wahba problem task, as well as a set of vision-based next-best pose robot manipulation tasks from RLBench.
arXiv Detail & Related papers (2022-02-08T16:09:02Z) - Data-efficient Domain Randomization with Bayesian Optimization [34.854609756970305]
When learning policies for robot control, the required real-world data is typically prohibitively expensive to acquire.
BayRn is a black-box sim-to-real algorithm that solves tasks efficiently by adapting the domain parameter distribution.
Our results show that BayRn is able to perform sim-to-real transfer, while significantly reducing the required prior knowledge.
arXiv Detail & Related papers (2020-03-05T07:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.