CM3: A Causal Masked Multimodal Model of the Internet
- URL: http://arxiv.org/abs/2201.07520v1
- Date: Wed, 19 Jan 2022 10:45:38 GMT
- Title: CM3: A Causal Masked Multimodal Model of the Internet
- Authors: Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu
Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, Luke
Zettlemoyer
- Abstract summary: We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents.
We train causally masked language-image models on large-scale web and Wikipedia articles.
CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts.
- Score: 86.32652030161374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce CM3, a family of causally masked generative models trained over
a large corpus of structured multi-modal documents that can contain both text
and image tokens. Our new causally masked approach generates tokens left to
right while also masking out a small number of long token spans that are
generated at the end of the string, instead of their original positions. The
casual masking object provides a type of hybrid of the more common causal and
masked language models, by enabling full generative modeling while also
providing bidirectional context when generating the masked spans. We train
causally masked language-image models on large-scale web and Wikipedia
articles, where each document contains all of the text, hypertext markup,
hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they
appear in the original HTML source (before masking). The resulting CM3 models
can generate rich structured, multi-modal outputs while conditioning on
arbitrary masked document contexts, and thereby implicitly learn a wide range
of text, image, and cross modal tasks. They can be prompted to recover, in a
zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM.
We set the new state-of-the-art in zero-shot summarization, entity linking, and
entity disambiguation while maintaining competitive performance in the
fine-tuning setting. We can generate images unconditionally, conditioned on
text (like DALL-E) and do captioning all in a zero-shot setting with a single
model.
Related papers
- Multi-Modal Generative Embedding Model [34.34876575183736]
We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model.
For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models.
The advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.
arXiv Detail & Related papers (2024-05-29T17:59:10Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context.
Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks.
Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.