3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
- URL: http://arxiv.org/abs/2403.03954v6
- Date: Sat, 8 Jun 2024 06:17:48 GMT
- Title: 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
- Authors: Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, Huazhe Xu,
- Abstract summary: 3D Diffusion Policy (DP3) is a novel visual imitation learning approach.
In experiments, DP3 handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement.
In real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do.
- Score: 19.41216557646392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .
Related papers
- Part-Guided 3D RL for Sim2Real Articulated Object Manipulation [27.422878372169805]
We propose a part-guided 3D RL framework, which can learn to manipulate articulated objects without demonstrations.
We combine the strengths of 2D segmentation and 3D RL to improve the efficiency of RL policy training.
A single versatile RL policy can be trained on multiple articulated object manipulation tasks simultaneously in simulation.
arXiv Detail & Related papers (2024-04-26T10:18:17Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [19.914227905704102]
3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views.
We present 3D diffuser actor, a neural policy equipped with a novel 3D denoising transformer.
It sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA.
It also learns to control a robot manipulator in the real world from a handful of demonstrations.
arXiv Detail & Related papers (2024-02-16T18:43:02Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - UniPAD: A Universal Pre-training Paradigm for Autonomous Driving [74.34701012543968]
We present UniPAD, a novel self-supervised learning paradigm applying 3D differentiable rendering.
UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures.
Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively.
arXiv Detail & Related papers (2023-10-12T14:39:58Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - R3M: A Universal Visual Representation for Robot Manipulation [91.55543664116209]
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
arXiv Detail & Related papers (2022-03-23T17:55:09Z) - Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.