RVT: Robotic View Transformer for 3D Object Manipulation
- URL: http://arxiv.org/abs/2306.14896v1
- Date: Mon, 26 Jun 2023 17:59:31 GMT
- Title: RVT: Robotic View Transformer for 3D Object Manipulation
- Authors: Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, Dieter Fox
- Abstract summary: We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate.
A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
- Score: 46.25268237442356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For 3D object manipulation, methods that build an explicit 3D representation
perform better than those relying only on camera images. But using explicit 3D
representations like voxels comes at large computing cost, adversely affecting
scalability. In this work, we propose RVT, a multi-view transformer for 3D
manipulation that is both scalable and accurate. Some key features of RVT are
an attention mechanism to aggregate information across views and re-rendering
of the camera input from virtual views around the robot workspace. In
simulations, we find that a single RVT model works well across 18 RLBench tasks
with 249 task variations, achieving 26% higher relative success than the
existing state-of-the-art method (PerAct). It also trains 36X faster than
PerAct for achieving the same performance and achieves 2.3X the inference speed
of PerAct. Further, RVT can perform a variety of manipulation tasks in the real
world with just a few ($\sim$10) demonstrations per task. Visual results, code,
and trained model are provided at https://robotic-view-transformer.github.io/.
Related papers
- Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration.
We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video.
Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion.
We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z) - EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation [10.54770475137596]
We propose RoboUniView, an innovative approach that decouples visual feature extraction from action learning.
We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation.
We achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D to D$ setting from 93.0% to 96.2%, and in the $ABC to D$ setting from 92.2% to 94.2%.
arXiv Detail & Related papers (2024-06-27T08:13:33Z) - 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation [53.45111493465405]
We propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders.
We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions.
We show promising results on a real robot platform with minimal finetuning.
arXiv Detail & Related papers (2024-06-26T08:17:59Z) - RVT-2: Learning Precise Manipulation from Few Demonstrations [43.48649783097065]
RVT-2 is a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT.
It achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%.
RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations.
arXiv Detail & Related papers (2024-06-12T18:00:01Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [52.94101901600948]
We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation.
PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action"
Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
arXiv Detail & Related papers (2022-09-12T17:51:05Z) - R3M: A Universal Visual Representation for Robot Manipulation [91.55543664116209]
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of robotic manipulation tasks.
We find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo.
arXiv Detail & Related papers (2022-03-23T17:55:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.