Multi-View Masked World Models for Visual Robotic Manipulation
- URL: http://arxiv.org/abs/2302.02408v2
- Date: Wed, 31 May 2023 08:13:44 GMT
- Title: Multi-View Masked World Models for Visual Robotic Manipulation
- Authors: Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter
Abbeel
- Abstract summary: We train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints.
We demonstrate the effectiveness of our method in a range of scenarios.
We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization.
- Score: 132.97980128530017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual robotic manipulation research and applications often use multiple
cameras, or views, to better perceive the world. How else can we utilize the
richness of multi-view data? In this paper, we investigate how to learn good
representations with multi-view data and utilize them for visual robotic
manipulation. Specifically, we train a multi-view masked autoencoder which
reconstructs pixels of randomly masked viewpoints and then learn a world model
operating on the representations from the autoencoder. We demonstrate the
effectiveness of our method in a range of scenarios, including multi-view
control and single-view control with auxiliary cameras for representation
learning. We also show that the multi-view masked autoencoder trained with
multiple randomized viewpoints enables training a policy with strong viewpoint
randomization and transferring the policy to solve real-robot tasks without
camera calibration and an adaptation procedure. Video demonstrations are
available at: https://sites.google.com/view/mv-mwm.
Related papers
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos.
We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Linking vision and motion for self-supervised object-centric perception [16.821130222597155]
Object-centric representations enable autonomous driving algorithms to reason about interactions between many independent agents and scene features.
Traditionally these representations have been obtained via supervised learning, but this decouples perception from the downstream driving task and could harm generalization.
We adapt a self-supervised object-centric vision model to perform object decomposition using only RGB video and the pose of the vehicle as inputs.
arXiv Detail & Related papers (2023-07-14T04:21:05Z) - Masked Visual Pre-training for Motor Control [118.18189211080225]
Self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels.
We freeze the visual encoder and train neural network controllers on top with reinforcement learning.
This is the first self-supervised model to exploit real-world images at scale for motor control.
arXiv Detail & Related papers (2022-03-11T18:58:10Z) - Look Closer: Bridging Egocentric and Third-Person Views with
Transformers for Robotic Manipulation [15.632809977544907]
Learning to solve precision-based manipulation tasks from visual feedback could drastically reduce the engineering efforts required by traditional robot systems.
We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist.
To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism.
arXiv Detail & Related papers (2022-01-19T18:39:03Z) - Seeing All the Angles: Learning Multiview Manipulation Policies for
Contact-Rich Tasks from Demonstrations [7.51557557629519]
A successful multiview policy could be deployed on a mobile manipulation platform.
We demonstrate that a multiview policy can be found through imitation learning by collecting data from a variety of viewpoints.
We show that learning from multiview data has little, if any, penalty to performance for a fixed-view task compared to learning with an equivalent amount of fixed-view data.
arXiv Detail & Related papers (2021-04-28T17:43:29Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.