Masked Modeling for Self-supervised Representation Learning on Vision
and Beyond
- URL: http://arxiv.org/abs/2401.00897v2
- Date: Tue, 9 Jan 2024 16:09:47 GMT
- Title: Masked Modeling for Self-supervised Representation Learning on Vision
and Beyond
- Authors: Siyuan Li, Luyuan Zhang, Zedong Wang, Di Wu, Lirong Wu, Zicheng Liu,
Jun Xia, Cheng Tan, Yang Liu, Baigui Sun, Stan Z. Li
- Abstract summary: Masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training.
We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more.
We conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research.
- Score: 69.64364187449773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the deep learning revolution marches on, self-supervised learning has
garnered increasing attention in recent years thanks to its remarkable
representation learning ability and the low dependence on labeled data. Among
these varied self-supervised techniques, masked modeling has emerged as a
distinctive approach that involves predicting parts of the original data that
are proportionally masked during training. This paradigm enables deep models to
learn robust representations and has demonstrated exceptional performance in
the context of computer vision, natural language processing, and other
modalities. In this survey, we present a comprehensive review of the masked
modeling framework and its methodology. We elaborate on the details of
techniques within masked modeling, including diverse masking strategies,
recovering targets, network architectures, and more. Then, we systematically
investigate its wide-ranging applications across domains. Furthermore, we also
explore the commonalities and differences between masked modeling methods in
different fields. Toward the end of this paper, we conclude by discussing the
limitations of current techniques and point out several potential avenues for
advancing masked modeling research. A paper list project with this survey is
available at \url{https://github.com/Lupin1998/Awesome-MIM}.
Related papers
- Advances in Diffusion Models for Image Data Augmentation: A Review of Methods, Models, Evaluation Metrics and Future Research Directions [6.2719115566879236]
Diffusion Models (DMs) have emerged as a powerful tool for image data augmentation.
DMs generate realistic and diverse images by learning the underlying data distribution.
Current challenges and future research directions in the field are discussed.
arXiv Detail & Related papers (2024-07-04T18:06:48Z) - Revisiting Self-supervised Learning of Speech Representation from a
Mutual Information Perspective [68.20531518525273]
We take a closer look into existing self-supervised methods of speech from an information-theoretic perspective.
We use linear probes to estimate the mutual information between the target information and learned representations.
We explore the potential of evaluating representations in a self-supervised fashion, where we estimate the mutual information between different parts of the data without using any labels.
arXiv Detail & Related papers (2024-01-16T21:13:22Z) - Overview of Class Activation Maps for Visualization Explainability [0.0]
Class Activation Maps (CAMs) enhance interpretability and insights into the decision-making process of deep learning models.
This work presents a comprehensive overview of the evolution of Class Activation Maps over time.
It also explores the metrics used for evaluating CAMs and introduces auxiliary techniques to improve the saliency of these methods.
arXiv Detail & Related papers (2023-09-25T17:20:51Z) - Self-supervised Multi-view Clustering in Computer Vision: A Survey [14.432997752719473]
Multi-view clustering (MVC) has had significant implications in cross-modal representation learning and data-driven decision-making.
This paper explores the reasons and advantages of the emergence of self-supervised MVC.
arXiv Detail & Related papers (2023-09-18T04:11:18Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Internal Representations of Vision Models Through the Lens of Frames on
Data Manifolds [8.67467876089153]
We present a new approach to studying such representations inspired by the idea of a frame on the tangent bundle of a manifold.
Our construction, which we call a neural frame, is formed by assembling a set of vectors representing specific types of perturbations of a data point.
Using neural frames, we make observations about the way that models process, layer-by-layer, specific modes of variation within a small neighborhood of a datapoint.
arXiv Detail & Related papers (2022-11-19T01:48:19Z) - Towards Interpretable Deep Reinforcement Learning Models via Inverse
Reinforcement Learning [27.841725567976315]
We propose a novel framework utilizing Adversarial Inverse Reinforcement Learning.
This framework provides global explanations for decisions made by a Reinforcement Learning model.
We capture intuitive tendencies that the model follows by summarizing the model's decision-making process.
arXiv Detail & Related papers (2022-03-30T17:01:59Z) - Explainable Matrix -- Visualization for Global and Local
Interpretability of Random Forest Classification Ensembles [78.6363825307044]
We propose Explainable Matrix (ExMatrix), a novel visualization method for Random Forest (RF) interpretability.
It employs a simple yet powerful matrix-like visual metaphor, where rows are rules, columns are features, and cells are rules predicates.
ExMatrix applicability is confirmed via different examples, showing how it can be used in practice to promote RF models interpretability.
arXiv Detail & Related papers (2020-05-08T21:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.