Multimodal Masked Autoencoders Learn Transferable Representations
- URL: http://arxiv.org/abs/2205.14204v1
- Date: Fri, 27 May 2022 19:09:42 GMT
- Title: Multimodal Masked Autoencoders Learn Transferable Representations
- Authors: Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurams, Sergey Levine, Pieter
Abbeel
- Abstract summary: We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
- Score: 127.35955819874063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building scalable models to learn from diverse, multimodal data remains an
open challenge. For vision-language data, the dominant approaches are based on
contrastive learning objectives that train a separate encoder for each
modality. While effective, contrastive learning approaches introduce sampling
bias depending on the data augmentations used, which can degrade performance on
downstream tasks. Moreover, these methods are limited to paired image-text
data, and cannot leverage widely-available unpaired data. In this paper, we
investigate whether a large multimodal model trained purely via masked token
prediction, without using modality-specific encoders or contrastive learning,
can learn transferable representations for downstream tasks. We propose a
simple and scalable network architecture, the Multimodal Masked Autoencoder
(M3AE), which learns a unified encoder for both vision and language data via
masked token prediction. We provide an empirical study of M3AE trained on a
large-scale image-text dataset, and find that M3AE is able to learn
generalizable representations that transfer well to downstream tasks.
Surprisingly, we find that M3AE benefits from a higher text mask ratio
(50-90%), in contrast to BERT whose standard masking ratio is 15%, due to the
joint training of two data modalities. We also provide qualitative analysis
showing that the learned representation incorporates meaningful information
from both image and language. Lastly, we demonstrate the scalability of M3AE
with larger model size and training time, and its flexibility to train on both
paired image-text data as well as unpaired data.
Related papers
- Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data.
Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks.
We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - Vision Learners Meet Web Image-Text Pairs [32.36188289972377]
In this work, we consider self-supervised pre-training on noisy web sourced image-text paired data.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We present a new visual representation pre-training method, MUlti-modal Generator(MUG), that learns from scalable web sourced image-text data.
arXiv Detail & Related papers (2023-01-17T18:53:24Z) - Multi-scale Transformer Network with Edge-aware Pre-training for
Cross-Modality MR Image Synthesis [52.41439725865149]
Cross-modality magnetic resonance (MR) image synthesis can be used to generate missing modalities from given ones.
Existing (supervised learning) methods often require a large number of paired multi-modal data to train an effective synthesis model.
We propose a Multi-scale Transformer Network (MT-Net) with edge-aware pre-training for cross-modality MR image synthesis.
arXiv Detail & Related papers (2022-12-02T11:40:40Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - X-Learner: Learning Cross Sources and Tasks for Universal Visual
Representation [71.51719469058666]
We propose a representation learning framework called X-Learner.
X-Learner learns the universal feature of multiple vision tasks supervised by various sources.
X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs.
arXiv Detail & Related papers (2022-03-16T17:23:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.