RVT: Robotic View Transformer for 3D Object Manipulation
- URL: http://arxiv.org/abs/2306.14896v1
- Date: Mon, 26 Jun 2023 17:59:31 GMT
- Title: RVT: Robotic View Transformer for 3D Object Manipulation
- Authors: Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, Dieter Fox
- Abstract summary: We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate.
A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
- Score: 46.25268237442356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For 3D object manipulation, methods that build an explicit 3D representation
perform better than those relying only on camera images. But using explicit 3D
representations like voxels comes at large computing cost, adversely affecting
scalability. In this work, we propose RVT, a multi-view transformer for 3D
manipulation that is both scalable and accurate. Some key features of RVT are
an attention mechanism to aggregate information across views and re-rendering
of the camera input from virtual views around the robot workspace. In
simulations, we find that a single RVT model works well across 18 RLBench tasks
with 249 task variations, achieving 26% higher relative success than the
existing state-of-the-art method (PerAct). It also trains 36X faster than
PerAct for achieving the same performance and achieves 2.3X the inference speed
of PerAct. Further, RVT can perform a variety of manipulation tasks in the real
world with just a few ($\sim$10) demonstrations per task. Visual results, code,
and trained model are provided at https://robotic-view-transformer.github.io/.
Related papers
- 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation [53.45111493465405]
We propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders.
We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions.
We show promising results on a real robot platform with minimal finetuning.
arXiv Detail & Related papers (2024-06-26T08:17:59Z) - RVT-2: Learning Precise Manipulation from Few Demonstrations [43.48649783097065]
RVT-2 is a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT.
It achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%.
RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations.
arXiv Detail & Related papers (2024-06-12T18:00:01Z) - WidthFormer: Toward Efficient Transformer-based BEV View Transformation [23.055953867959744]
WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy.
We propose a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
Our model is highly efficient. For example, when using $256times 704$ input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 solutions.
arXiv Detail & Related papers (2024-01-08T11:50:23Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [52.94101901600948]
We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation.
PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action"
Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
arXiv Detail & Related papers (2022-09-12T17:51:05Z) - Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding.
We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z) - R3M: A Universal Visual Representation for Robot Manipulation [91.55543664116209]
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
arXiv Detail & Related papers (2022-03-23T17:55:09Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.