OmniMAE: Single Model Masked Pretraining on Images and Videos
- URL: http://arxiv.org/abs/2206.08356v2
- Date: Wed, 31 May 2023 04:53:11 GMT
- Title: OmniMAE: Single Model Masked Pretraining on Images and Videos
- Authors: Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev
Alwala, Armand Joulin, Ishan Misra
- Abstract summary: Masked autoencoding can be used to train a simple Vision Transformer on images and videos.
We show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark.
- Score: 40.985481596672265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based architectures have become competitive across a variety of
visual domains, most notably images and videos. While prior work studies these
modalities in isolation, having a common architecture suggests that one can
train a single unified model for multiple visual modalities. Prior attempts at
unified modeling typically use architectures tailored for vision tasks, or
obtain worse performance compared to single modality models. In this work, we
show that masked autoencoding can be used to train a simple Vision Transformer
on images and videos, without requiring any labeled data. This single model
learns visual representations that are comparable to or better than
single-modality representations on both image and video benchmarks, while using
a much simpler architecture. Furthermore, this model can be learned by dropping
90% of the image and 95% of the video patches, enabling extremely fast training
of huge model architectures. In particular, we show that our single ViT-Huge
model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the
challenging Something Something-v2 video benchmark, setting a new
state-of-the-art.
Related papers
- Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - AIM: Adapting Image Models for Efficient Video Action Recognition [22.805026175928997]
We propose a method to Adapt pre-trained Image Models (AIM) for efficient video understanding.
By freezing the pre-trained video model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation.
We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters.
arXiv Detail & Related papers (2023-02-06T18:59:17Z) - Omnivore: A Single Model for Many Visual Modalities [47.94002558594031]
Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
We propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters.
arXiv Detail & Related papers (2022-01-20T18:58:03Z) - A strong baseline for image and video quality assessment [4.73466728067544]
We present a simple yet effective unified model for perceptual quality assessment of image and video.
Our model achieves a comparable performance by applying only one global feature derived from a backbone network.
Based on the architecture proposed, we release the models well trained for three common real-world scenarios.
arXiv Detail & Related papers (2021-11-13T12:24:08Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.