3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
- URL: http://arxiv.org/abs/2402.10885v3
- Date: Thu, 25 Jul 2024 14:30:22 GMT
- Title: 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
- Authors: Tsung-Wei Ke, Nikolaos Gkanatsios, Katerina Fragkiadaki,
- Abstract summary: 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views.
We present 3D diffuser actor, a neural policy equipped with a novel 3D denoising transformer.
It sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA.
It also learns to control a robot manipulator in the real world from a handful of demonstrations.
- Score: 19.914227905704102
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer that fuses information from the 3D visual scene, a language instruction and proprioception to predict the noise in noised 3D robot pose trajectories. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase. It also learns to control a robot manipulator in the real world from a handful of demonstrations. Through thorough comparisons with the current SOTA policies and ablations of our model, we show 3D Diffuser Actor's design choices dramatically outperform 2D representations, regression and classification objectives, absolute attentions, and holistic non-tokenized 3D scene embeddings.
Related papers
- Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting [27.45827655042124]
We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS)
We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method.
We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty.
arXiv Detail & Related papers (2024-10-07T01:24:39Z) - 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation [53.45111493465405]
We propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders.
We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions.
We show promising results on a real robot platform with minimal finetuning.
arXiv Detail & Related papers (2024-06-26T08:17:59Z) - Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text [61.9973218744157]
We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories.
Experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.
arXiv Detail & Related papers (2024-06-25T14:42:51Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields [57.617972778377215]
We show how to generate effective 3D representations from posed RGB images.
We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images.
Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks.
arXiv Detail & Related papers (2024-04-01T17:59:55Z) - 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations [19.41216557646392]
3D Diffusion Policy (DP3) is a novel visual imitation learning approach.
In experiments, DP3 handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement.
In real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do.
arXiv Detail & Related papers (2024-03-06T18:58:49Z) - GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning [67.61509647032862]
We propose GOEmbed (Gradient Origin Embeddings) that encodes input 2D images into any 3D representation.
Unlike typical prior approaches in which input images are encoded using 2D features extracted from large pre-trained models, or customized features are designed to handle different 3D representations.
arXiv Detail & Related papers (2023-12-14T08:39:39Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - Indoor Semantic Scene Understanding using Multi-modality Fusion [0.0]
We present a semantic scene understanding pipeline that fuses 2D and 3D detection branches to generate a semantic map of the environment.
Unlike previous works that were evaluated on collected datasets, we test our pipeline on an active photo-realistic robotic environment.
Our novelty includes rectification of 3D proposals using projected 2D detections and modality fusion based on object size.
arXiv Detail & Related papers (2021-08-17T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.