Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning
- URL: http://arxiv.org/abs/2305.18499v2
- Date: Fri, 27 Oct 2023 03:28:48 GMT
- Title: Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning
- Authors: Jialong Wu, Haoyu Ma, Chaoyi Deng, Mingsheng Long
- Abstract summary: In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
- Score: 54.67880602409801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised pre-training methods utilizing large and diverse datasets have
achieved tremendous success across a range of domains. Recent work has
investigated such unsupervised pre-training methods for model-based
reinforcement learning (MBRL) but is limited to domain-specific or simulated
data. In this paper, we study the problem of pre-training world models with
abundant in-the-wild videos for efficient learning of downstream visual control
tasks. However, in-the-wild videos are complicated with various contextual
factors, such as intricate backgrounds and textured appearance, which precludes
a world model from extracting shared world knowledge to generalize better. To
tackle this issue, we introduce Contextualized World Models (ContextWM) that
explicitly separate context and dynamics modeling to overcome the complexity
and diversity of in-the-wild videos and facilitate knowledge transfer between
distinct scenes. Specifically, a contextualized extension of the latent
dynamics model is elaborately realized by incorporating a context encoder to
retain contextual information and empower the image decoder, which encourages
the latent dynamics model to concentrate on essential temporal variations. Our
experiments show that in-the-wild video pre-training equipped with ContextWM
can significantly improve the sample efficiency of MBRL in various domains,
including robotic manipulation, locomotion, and autonomous driving. Code is
available at this repository: https://github.com/thuml/ContextWM.
Related papers
- Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Style-Hallucinated Dual Consistency Learning: A Unified Framework for
Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks.
Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z) - Trajectory-wise Multiple Choice Learning for Dynamics Generalization in
Reinforcement Learning [137.39196753245105]
We present a new model-based reinforcement learning algorithm that learns a multi-headed dynamics model for dynamics generalization.
We incorporate context learning, which encodes dynamics-specific information from past experiences into the context latent vector.
Our method exhibits superior zero-shot generalization performance across a variety of control tasks, compared to state-of-the-art RL methods.
arXiv Detail & Related papers (2020-10-26T03:20:42Z) - Representation Learning with Video Deep InfoMax [26.692717942430185]
We extend DeepInfoMax to the video domain by leveraging similar structure intemporal networks.
We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks.
arXiv Detail & Related papers (2020-07-27T02:28:47Z) - Context-aware Dynamics Model for Generalization in Model-Based
Reinforcement Learning [124.9856253431878]
We decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it.
In order to encode dynamics-specific information into the context latent vector, we introduce a novel loss function that encourages the context latent vector to be useful for predicting both forward and backward dynamics.
The proposed method achieves superior generalization ability across various simulated robotics and control tasks, compared to existing RL schemes.
arXiv Detail & Related papers (2020-05-14T08:10:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.