Learning and Leveraging World Models in Visual Representation Learning
- URL: http://arxiv.org/abs/2403.00504v1
- Date: Fri, 1 Mar 2024 13:05:38 GMT
- Title: Learning and Leveraging World Models in Visual Representation Learning
- Authors: Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes,
Laurent Najman, Yann LeCun
- Abstract summary: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model.
We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space.
- Score: 34.81177885432796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising
self-supervised approach that learns by leveraging a world model. While
previously limited to predicting missing parts of an input, we explore how to
generalize the JEPA prediction task to a broader set of corruptions. We
introduce Image World Models, an approach that goes beyond masked image
modeling and learns to predict the effect of global photometric transformations
in latent space. We study the recipe of learning performant IWMs and show that
it relies on three key aspects: conditioning, prediction difficulty, and
capacity. Additionally, we show that the predictive world model learned by IWM
can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM
world model matches or surpasses the performance of previous self-supervised
methods. Finally, we show that learning with an IWM allows one to control the
abstraction level of the learned representations, learning invariant
representations such as contrastive methods, or equivariant representations
such as masked image modelling.
Related papers
- From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling [11.634154932876719]
Masked Image Modeling has emerged as a powerful self-supervised learning paradigm for visual representation learning.
We propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset.
Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning.
arXiv Detail & Related papers (2024-11-16T03:21:06Z) - Masked Generative Priors Improve World Models Sequence Modelling Capabilities [19.700020499490137]
Masked Generative Modelling has emerged as a more efficient and superior inductive bias for modelling.
GIT-STORM demonstrates substantial performance gains in RL tasks on the Atari 100k benchmark.
We apply Transformer-based World Models to continuous action environments for the first time, addressing a significant gap in prior research.
arXiv Detail & Related papers (2024-10-10T11:52:07Z) - A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Masked Modeling for Self-supervised Representation Learning on Vision
and Beyond [69.64364187449773]
Masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training.
We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more.
We conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research.
arXiv Detail & Related papers (2023-12-31T12:03:21Z) - ReCoRe: Regularized Contrastive Representation Learning of World Model [21.29132219042405]
We present a world model that learns invariant features using contrastive unsupervised learning and an intervention-invariant regularizer.
Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly improves on out-of-distribution point navigation tasks evaluated on the iGibson benchmark.
arXiv Detail & Related papers (2023-12-14T15:53:07Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - Generalizable Imitation Learning Through Pre-Trained Representations [19.98418419179064]
We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations.
Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types.
arXiv Detail & Related papers (2023-11-15T20:15:51Z) - Unifying (Machine) Vision via Counterfactual World Modeling [5.001446411351483]
We introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model.
CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision.
We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks.
arXiv Detail & Related papers (2023-06-02T17:45:44Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Predictive Experience Replay for Continual Visual Control and
Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - Learning by Distillation: A Self-Supervised Learning Framework for
Optical Flow Estimation [71.76008290101214]
DistillFlow is a knowledge distillation approach to learning optical flow.
It achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets.
Our models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark.
arXiv Detail & Related papers (2021-06-08T09:13:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.