Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation
- URL: http://arxiv.org/abs/2306.17817v2
- Date: Thu, 19 Oct 2023 19:36:31 GMT
- Title: Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation
- Authors: Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, Katerina Fragkiadaki
- Abstract summary: Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
- Score: 18.964403296437027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D perceptual representations are well suited for robot manipulation as they
easily encode occlusions and simplify spatial reasoning. Many manipulation
tasks require high spatial precision in end-effector pose prediction, which
typically demands high-resolution 3D feature grids that are computationally
expensive to process. As a result, most manipulation policies operate directly
in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a
manipulation policy transformer that represents the robot's workspace using a
3D feature field with adaptive resolutions dependent on the task at hand. The
model lifts 2D pre-trained features to 3D using sensed depth, and attends to
them to compute features for sampled 3D points. It samples 3D point grids in a
coarse to fine manner, featurizes them using relative-position attention, and
selects where to focus the next round of point sampling. In this way, it
efficiently computes 3D action maps of high spatial resolution. Act3D sets a
new state-of-the-art in RL-Bench, an established manipulation benchmark, where
it achieves 10% absolute improvement over the previous SOTA 2D multi-view
policy on 74 RLBench tasks and 22% absolute improvement with 3x less compute
over the previous SOTA 3D policy. We quantify the importance of relative
spatial attention, large-scale vision-language pre-trained 2D backbones, and
weight tying across coarse-to-fine attentions in ablative experiments. Code and
videos are available on our project website: https://act3d.github.io/.
Related papers
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D
Data [15.53270401654078]
OVIR-3D is a method for open-vocabulary 3D object instance retrieval without using any 3D data for training.
It is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space.
Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.
arXiv Detail & Related papers (2023-11-06T05:00:00Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Multi-View Representation is What You Need for Point-Cloud Pre-Training [22.55455166875263]
This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks.
We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss.
Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks.
arXiv Detail & Related papers (2023-06-05T03:14:54Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.