Masked World Models for Visual Control
- URL: http://arxiv.org/abs/2206.14244v3
- Date: Sat, 27 May 2023 09:29:48 GMT
- Title: Masked World Models for Visual Control
- Authors: Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James,
Kimin Lee, Pieter Abbeel
- Abstract summary: We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
- Score: 90.13638482124567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual model-based reinforcement learning (RL) has the potential to enable
sample-efficient robot learning from visual observations. Yet the current
approaches typically train a single model end-to-end for learning both visual
representations and dynamics, making it difficult to accurately model the
interaction between robots and small objects. In this work, we introduce a
visual model-based RL framework that decouples visual representation learning
and dynamics learning. Specifically, we train an autoencoder with convolutional
layers and vision transformers (ViT) to reconstruct pixels given masked
convolutional features, and learn a latent dynamics model that operates on the
representations from the autoencoder. Moreover, to encode task-relevant
information, we introduce an auxiliary reward prediction objective for the
autoencoder. We continually update both autoencoder and dynamics model using
online samples collected from environment interaction. We demonstrate that our
decoupling approach achieves state-of-the-art performance on a variety of
visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7%
success rate on 50 visual robotic manipulation tasks from Meta-world, while the
baseline achieves 67.9%. Code is available on the project website:
https://sites.google.com/view/mwm-rl.
Related papers
- Theia: Distilling Diverse Vision Foundation Models for Robot Learning [6.709078873834651]
Theia is a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks.
Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning.
arXiv Detail & Related papers (2024-07-29T17:08:21Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z) - MELD: Meta-Reinforcement Learning from Images via Latent State Models [109.1664295663325]
We develop an algorithm for meta-RL from images that performs inference in a latent state model to quickly acquire new skills.
MELD is the first meta-RL algorithm trained in a real-world robotic control setting from images.
arXiv Detail & Related papers (2020-10-26T23:50:30Z) - Model-Based Inverse Reinforcement Learning from Visual Demonstrations [20.23223474119314]
We present a gradient-based inverse reinforcement learning framework that learns cost functions when given only visual human demonstrations.
The learned cost functions are then used to reproduce the demonstrated behavior via visual model predictive control.
We evaluate our framework on hardware on two basic object manipulation tasks.
arXiv Detail & Related papers (2020-10-18T17:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.