Related papers: 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

URL: http://arxiv.org/abs/2406.18158v1
Date: Wed, 26 Jun 2024 08:17:59 GMT
Title: 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation
Authors: Shengyi Qian, Kaichun Mo, Valts Blukis, David F. Fouhey, Dieter Fox, Ankit Goyal,
Abstract summary: We propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict pose actions. We show promising results on a real robot platform with minimal finetuning.
Score: 53.45111493465405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. We also show promising results on a real robot platform with minimal finetuning. Our results suggest that 3D-aware pretraining is a promising approach to improve sample efficiency and generalization of vision-based robotic manipulation policies. We will release code and pretrained models for 3D-MVP to facilitate future research. Project site: https://jasonqsy.github.io/3DMVP

Related papers

Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z)
Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction [51.49400490437258]
This work develops a method for imitating articulated object manipulation from a single monocular RGB human demonstration. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot.
arXiv Detail & Related papers (2024-09-26T17:57:16Z)
Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression [78.93023152602525]
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. We propose a simple yet effective method called TokenCompression3D (ToC3D) Our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution.
arXiv Detail & Related papers (2024-09-01T06:58:08Z)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration. An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z)
HRP: Human Affordances for Robotic Pre-Training [15.92416819748365]
We present a framework for pre-training representations on hand, object, and contact. We experimentally demonstrate (using 3000+ robot trials) that this affordance pre-training scheme boosts performance by a minimum of 15% on 5 real-world tasks.
arXiv Detail & Related papers (2024-07-26T17:59:52Z)
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields [57.617972778377215]
We show how to generate effective 3D representations from posed RGB images. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks.
arXiv Detail & Related papers (2024-04-01T17:59:55Z)
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [19.914227905704102]
3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views. We present 3D diffuser actor, a neural policy equipped with a novel 3D denoising transformer. It sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA. It also learns to control a robot manipulator in the real world from a handful of demonstrations.
arXiv Detail & Related papers (2024-02-16T18:43:02Z)
PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation [93.46306666726969]
PolarNet is a 3D point cloud based policy for language-guided manipulation. It learns 3D point cloud representations and integrate them with language instructions for action prediction. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning.
arXiv Detail & Related papers (2023-09-27T11:50:43Z)
Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z)
RVT: Robotic View Transformer for 3D Object Manipulation [46.25268237442356]
We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. A single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct)
arXiv Detail & Related papers (2023-06-26T17:59:31Z)
Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning? [30.59796205121887]
We show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT) Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN.
arXiv Detail & Related papers (2022-12-16T07:46:53Z)
Multi-View Transformer for 3D Visual Grounding [64.30493173825234]
We propose a Multi-View Transformer (MVT) for 3D visual grounding. We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together.
arXiv Detail & Related papers (2022-04-05T12:59:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.