Learning and Leveraging World Models in Visual Representation Learning
- URL: http://arxiv.org/abs/2403.00504v1
- Date: Fri, 1 Mar 2024 13:05:38 GMT
- Title: Learning and Leveraging World Models in Visual Representation Learning
- Authors: Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes,
Laurent Najman, Yann LeCun
- Abstract summary: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising self-supervised approach that learns by leveraging a world model.
We introduce Image World Models, an approach that goes beyond masked image modeling and learns to predict the effect of global photometric transformations in latent space.
- Score: 34.81177885432796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint-Embedding Predictive Architecture (JEPA) has emerged as a promising
self-supervised approach that learns by leveraging a world model. While
previously limited to predicting missing parts of an input, we explore how to
generalize the JEPA prediction task to a broader set of corruptions. We
introduce Image World Models, an approach that goes beyond masked image
modeling and learns to predict the effect of global photometric transformations
in latent space. We study the recipe of learning performant IWMs and show that
it relies on three key aspects: conditioning, prediction difficulty, and
capacity. Additionally, we show that the predictive world model learned by IWM
can be adapted through finetuning to solve diverse tasks; a fine-tuned IWM
world model matches or surpasses the performance of previous self-supervised
methods. Finally, we show that learning with an IWM allows one to control the
abstraction level of the learned representations, learning invariant
representations such as contrastive methods, or equivariant representations
such as masked image modelling.
Related papers
- A Probabilistic Model behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Masked Modeling for Self-supervised Representation Learning on Vision
and Beyond [69.64364187449773]
Masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training.
We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more.
We conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research.
arXiv Detail & Related papers (2023-12-31T12:03:21Z) - ReCoRe: Regularized Contrastive Representation Learning of World Model [21.29132219042405]
We present a world model that learns invariant features using contrastive unsupervised learning and an intervention-invariant regularizer.
Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly improves on out-of-distribution point navigation tasks evaluated on the iGibson benchmark.
arXiv Detail & Related papers (2023-12-14T15:53:07Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - Generalizable Imitation Learning Through Pre-Trained Representations [19.98418419179064]
We introduce BC-ViT, an imitation learning algorithm that leverages rich DINO pre-trained Visual Transformer (ViT) patch-level embeddings to obtain better generalization when learning through demonstrations.
Our learner sees the world by clustering appearance features into semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types.
arXiv Detail & Related papers (2023-11-15T20:15:51Z) - Zero-Shot Image Harmonization with Generative Model Prior [22.984119094424056]
We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images.
We introduce a fully modularized framework inspired by human behavior.
We present compelling visual results across diverse scenes and objects, along with a user study validating our approach.
arXiv Detail & Related papers (2023-07-17T00:56:21Z) - Unifying (Machine) Vision via Counterfactual World Modeling [5.001446411351483]
We introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model.
CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision.
We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks.
arXiv Detail & Related papers (2023-06-02T17:45:44Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Predictive Experience Replay for Continual Visual Control and
Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - Dream to Explore: Adaptive Simulations for Autonomous Systems [3.0664963196464448]
We tackle the problem of learning to control dynamical systems by applying Bayesian nonparametric methods.
By employing Gaussian processes to discover latent world dynamics, we mitigate common data efficiency issues observed in reinforcement learning.
Our algorithm jointly learns a world model and policy by optimizing a variational lower bound of a log-likelihood.
arXiv Detail & Related papers (2021-10-27T04:27:28Z) - Learning by Distillation: A Self-Supervised Learning Framework for
Optical Flow Estimation [71.76008290101214]
DistillFlow is a knowledge distillation approach to learning optical flow.
It achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets.
Our models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark.
arXiv Detail & Related papers (2021-06-08T09:13:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.