MoMo: A shared encoder Model for text, image and multi-Modal
representations
- URL: http://arxiv.org/abs/2304.05523v1
- Date: Tue, 11 Apr 2023 22:26:10 GMT
- Title: MoMo: A shared encoder Model for text, image and multi-Modal
representations
- Authors: Rakesh Chada, Zhaoheng Zheng, Pradeep Natarajan
- Abstract summary: We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
- Score: 4.812718493682455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a self-supervised shared encoder model that achieves strong
results on several visual, language and multimodal benchmarks while being data,
memory and run-time efficient. We make three key contributions. First, in
contrast to most existing works, we use a single transformer with all the
encoder layers processing both the text and the image modalities. Second, we
propose a stage-wise training strategy where the model is first trained on
images, then jointly with unimodal text and image datasets and finally jointly
with text and text-image datasets. Third, to preserve information across both
the modalities, we propose a training pipeline that learns simultaneously from
gradient updates of different modalities at each training update step. The
results on downstream text-only, image-only and multimodal tasks show that our
model is competitive with several strong models while using fewer parameters
and lesser pre-training data. For example, MoMo performs competitively with
FLAVA on multimodal (+3.1), image-only (+1.1) and text-only (-0.1) tasks
despite having 2/5th the number of parameters and using 1/3rd the image-text
training pairs. Finally, we ablate various design choices and further show that
increasing model size produces significant performance gains indicating
potential for substantial improvements with larger models using our approach.
Related papers
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context.
Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks.
Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.