MaskViT: Masked Visual Pre-Training for Video Prediction
- URL: http://arxiv.org/abs/2206.11894v1
- Date: Thu, 23 Jun 2022 17:59:33 GMT
- Title: MaskViT: Masked Visual Pre-Training for Video Prediction
- Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto
Mart\'in-Mart\'in, Li Fei-Fei
- Abstract summary: We create good video prediction models by pre-training transformers via masked visual modeling.
MaskViT outperforms prior works in video prediction, is parameter efficient and can generate high-resolution videos.
Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling.
- Score: 29.25521342538311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to predict future visual observations conditioned on past
observations and motor commands can enable embodied agents to plan solutions to
a variety of tasks in complex environments. This work shows that we can create
good video prediction models by pre-training transformers via masked visual
modeling. Our approach, named MaskViT, is based on two simple design decisions.
First, for memory and training efficiency, we use two types of window
attention: spatial and spatiotemporal. Second, during training, we mask a
variable percentage of tokens instead of a fixed mask ratio. For inference,
MaskViT generates all tokens via iterative refinement where we incrementally
decrease the masking ratio following a mask scheduling function. On several
datasets we demonstrate that MaskViT outperforms prior works in video
prediction, is parameter efficient, and can generate high-resolution videos
(256x256). Further, we demonstrate the benefits of inference speedup (up to
512x) due to iterative decoding by using MaskViT for planning on a real robot.
Our work suggests that we can endow embodied agents with powerful predictive
models by leveraging the general framework of masked visual modeling with
minimal domain knowledge.
Related papers
- ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Sample-specific Masks for Visual Reprogramming-based Prompting [20.27639343292564]
Visual reprogramming (VR) is a prompting technique that aims to re-purpose a pre-trained model to target tasks.
In this paper, we show that the shared mask potentially limits VR's generalization and increases its approximation error.
Motivated by this finding, we design a new framework for VR called sample-specific multi-channel masks (SMM)
arXiv Detail & Related papers (2024-06-05T11:15:43Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Toward a Deeper Understanding: RetNet Viewed through Convolution [25.8904146140577]
Vision Transformer (ViT) can learn global dependencies superior to CNN, yet CNN's inherent locality can substitute for expensive training resources.
This paper investigates the effectiveness of RetNet from a CNN perspective and presents a variant of RetNet tailored to the visual domain.
We propose a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks.
arXiv Detail & Related papers (2023-09-11T10:54:22Z) - MP-Former: Mask-Piloted Transformer for Image Segmentation [16.620469868310288]
Mask2Former suffers from inconsistent mask predictions between decoder layers.
We propose a mask-piloted training approach, which feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - ConvMAE: Masked Convolution Meets Masked Autoencoders [65.15953258300958]
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT.
Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
arXiv Detail & Related papers (2022-05-08T15:12:19Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.