Related papers: Multi-View Masked World Models for Visual Robotic Manipulation

Multi-View Masked World Models for Visual Robotic Manipulation

URL: http://arxiv.org/abs/2302.02408v2
Date: Wed, 31 May 2023 08:13:44 GMT
Title: Multi-View Masked World Models for Visual Robotic Manipulation
Authors: Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel
Abstract summary: We train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints. We demonstrate the effectiveness of our method in a range of scenarios. We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization.
Score: 132.97980128530017
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this paper, we investigate how to learn good representations with multi-view data and utilize them for visual robotic manipulation. Specifically, we train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints and then learn a world model operating on the representations from the autoencoder. We demonstrate the effectiveness of our method in a range of scenarios, including multi-view control and single-view control with auxiliary cameras for representation learning. We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization and transferring the policy to solve real-robot tasks without camera calibration and an adaptation procedure. Video demonstrations are available at: https://sites.google.com/view/mv-mwm.

Related papers

ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos [15.809468471562537]
ZeroMimic generates image goal-conditioned skill policies for several common manipulation tasks. We evaluate ZeroMimic's out-of-the-box performance in varied real-world and simulated kitchen settings. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints.
arXiv Detail & Related papers (2025-03-31T09:27:00Z)
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z)
RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation [10.54770475137596]
We propose RoboUniView, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. We achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D to D$ setting from 93.0% to 96.2%, and in the $ABC to D$ setting from 92.2% to 94.2%.
arXiv Detail & Related papers (2024-06-27T08:13:33Z)
Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z)
Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders. Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z)
Linking vision and motion for self-supervised object-centric perception [16.821130222597155]
Object-centric representations enable autonomous driving algorithms to reason about interactions between many independent agents and scene features. Traditionally these representations have been obtained via supervised learning, but this decouples perception from the downstream driving task and could harm generalization. We adapt a self-supervised object-centric vision model to perform object decomposition using only RGB video and the pose of the vehicle as inputs.
arXiv Detail & Related papers (2023-07-14T04:21:05Z)
Masked Visual Pre-training for Motor Control [118.18189211080225]
Self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We freeze the visual encoder and train neural network controllers on top with reinforcement learning. This is the first self-supervised model to exploit real-world images at scale for motor control.
arXiv Detail & Related papers (2022-03-11T18:58:10Z)
Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation [15.632809977544907]
Learning to solve precision-based manipulation tasks from visual feedback could drastically reduce the engineering efforts required by traditional robot systems. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism.
arXiv Detail & Related papers (2022-01-19T18:39:03Z)
Seeing All the Angles: Learning Multiview Manipulation Policies for Contact-Rich Tasks from Demonstrations [7.51557557629519]
A successful multiview policy could be deployed on a mobile manipulation platform. We demonstrate that a multiview policy can be found through imitation learning by collecting data from a variety of viewpoints. We show that learning from multiview data has little, if any, penalty to performance for a fixed-view task compared to learning with an equivalent amount of fixed-view data.
arXiv Detail & Related papers (2021-04-28T17:43:29Z)
Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task. DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.