Retrieval-Augmented Multimodal Language Modeling
- URL: http://arxiv.org/abs/2211.12561v2
- Date: Tue, 6 Jun 2023 00:28:34 GMT
- Title: Retrieval-Augmented Multimodal Language Modeling
- Authors: Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure
Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih
- Abstract summary: multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation.
We propose a retrieval-augmented multimodal model, which enables a base multimodal model to refer to relevant text and images fetched by a retriever from external memory.
Our resulting model, named Retrieval-Augmented CM3, is the first multimodal model that can retrieve and generate both text and images.
- Score: 176.9150885247416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent multimodal models such as DALL-E and CM3 have achieved remarkable
progress in text-to-image and image-to-text generation. However, these models
store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the
model parameters, requiring increasingly larger models and training data to
capture more knowledge. To integrate knowledge in a more scalable and modular
way, we propose a retrieval-augmented multimodal model, which enables a base
multimodal model (generator) to refer to relevant text and images fetched by a
retriever from external memory (e.g., documents on the web). Specifically, for
the retriever, we use a pretrained CLIP, and for the generator, we train a CM3
Transformer on the LAION dataset. Our resulting model, named
Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can
retrieve and generate both text and images. We show that RA-CM3 significantly
outperforms baseline multimodal models such as DALL-E and CM3 on both image and
caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while
requiring much less compute for training (<30% of DALL-E). Moreover, we show
that RA-CM3 exhibits novel capabilities, such as faithful image generation and
multimodal in-context learning (e.g., image generation from demonstrations).
Related papers
- Multi-Modal Generative Embedding Model [34.34876575183736]
We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model.
For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models.
The advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.
arXiv Detail & Related papers (2024-05-29T17:59:10Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - LRM: Large Reconstruction Model for Single Image to 3D [61.47357798633123]
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds.
LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image.
We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects.
arXiv Detail & Related papers (2023-11-08T00:03:52Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - MoMo: A shared encoder Model for text, image and multi-Modal
representations [4.812718493682455]
We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks.
We use a single transformer with all the encoder layers processing both the text and the image modalities.
arXiv Detail & Related papers (2023-04-11T22:26:10Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.