Retrieval-Augmented Multimodal Language Modeling
- URL: http://arxiv.org/abs/2211.12561v2
- Date: Tue, 6 Jun 2023 00:28:34 GMT
- Title: Retrieval-Augmented Multimodal Language Modeling
- Authors: Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure
Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih
- Abstract summary: multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation.
We propose a retrieval-augmented multimodal model, which enables a base multimodal model to refer to relevant text and images fetched by a retriever from external memory.
Our resulting model, named Retrieval-Augmented CM3, is the first multimodal model that can retrieve and generate both text and images.
- Score: 176.9150885247416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent multimodal models such as DALL-E and CM3 have achieved remarkable
progress in text-to-image and image-to-text generation. However, these models
store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the
model parameters, requiring increasingly larger models and training data to
capture more knowledge. To integrate knowledge in a more scalable and modular
way, we propose a retrieval-augmented multimodal model, which enables a base
multimodal model (generator) to refer to relevant text and images fetched by a
retriever from external memory (e.g., documents on the web). Specifically, for
the retriever, we use a pretrained CLIP, and for the generator, we train a CM3
Transformer on the LAION dataset. Our resulting model, named
Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can
retrieve and generate both text and images. We show that RA-CM3 significantly
outperforms baseline multimodal models such as DALL-E and CM3 on both image and
caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while
requiring much less compute for training (<30% of DALL-E). Moreover, we show
that RA-CM3 exhibits novel capabilities, such as faithful image generation and
multimodal in-context learning (e.g., image generation from demonstrations).
Related papers
- Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images.
Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z) - Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention [54.66152436050373]
We propose a Multi-view Large Reconstruction Model (M-LRM) to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner.
Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images.
Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity.
arXiv Detail & Related papers (2024-06-11T18:29:13Z) - Multi-Modal Generative Embedding Model [34.34876575183736]
We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model.
For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models.
The advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.
arXiv Detail & Related papers (2024-05-29T17:59:10Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.