3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
- URL: http://arxiv.org/abs/2403.03954v7
- Date: Fri, 27 Sep 2024 02:43:48 GMT
- Title: 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
- Authors: Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, Huazhe Xu,
- Abstract summary: 3D Diffusion Policy (DP3) is a novel visual imitation learning approach.
In experiments, DP3 handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement.
In real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do.
- Score: 19.41216557646392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .
Related papers
- Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration.
We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video.
Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion.
We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z) - Part-Guided 3D RL for Sim2Real Articulated Object Manipulation [27.422878372169805]
We propose a part-guided 3D RL framework, which can learn to manipulate articulated objects without demonstrations.
We combine the strengths of 2D segmentation and 3D RL to improve the efficiency of RL policy training.
A single versatile RL policy can be trained on multiple articulated object manipulation tasks simultaneously in simulation.
arXiv Detail & Related papers (2024-04-26T10:18:17Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [19.914227905704102]
3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views.
We present 3D diffuser actor, a neural policy equipped with a novel 3D denoising transformer.
It sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA.
It also learns to control a robot manipulator in the real world from a handful of demonstrations.
arXiv Detail & Related papers (2024-02-16T18:43:02Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - UniPAD: A Universal Pre-training Paradigm for Autonomous Driving [74.34701012543968]
We present UniPAD, a novel self-supervised learning paradigm applying 3D differentiable rendering.
UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures.
Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively.
arXiv Detail & Related papers (2023-10-12T14:39:58Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - R3M: A Universal Visual Representation for Robot Manipulation [91.55543664116209]
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
arXiv Detail & Related papers (2022-03-23T17:55:09Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.