Grounding Language Models to Images for Multimodal Inputs and Outputs
- URL: http://arxiv.org/abs/2301.13823v4
- Date: Tue, 13 Jun 2023 21:54:58 GMT
- Title: Grounding Language Models to Images for Multimodal Inputs and Outputs
- Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried
- Abstract summary: We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
- Score: 89.30027812161686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an efficient method to ground pretrained text-only language models
to the visual domain, enabling them to process arbitrarily interleaved
image-and-text data, and generate text interleaved with retrieved images. Our
method leverages the abilities of language models learnt from large scale
text-only pretraining, such as in-context learning and free-form text
generation. We keep the language model frozen, and finetune input and output
linear layers to enable cross-modality interactions. This allows our model to
process arbitrarily interleaved image-and-text inputs, and generate free-form
text interleaved with retrieved images. We achieve strong zero-shot performance
on grounded tasks such as contextual image retrieval and multimodal dialogue,
and showcase compelling interactive abilities. Our approach works with any
off-the-shelf language model and paves the way towards an effective, general
solution for leveraging pretrained language models in visually grounded
settings.
Related papers
- An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation [43.139415423751615]
Photo-sharing multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment.
A pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task.
We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model.
arXiv Detail & Related papers (2024-08-16T10:33:19Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Visual Grounding Strategies for Text-Only Natural Language Processing [1.2183405753834562]
multimodal extensions of BERT allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering.
Here, we leverage multimodal modeling for purely textual tasks with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy.
A first type of strategy, referred to as it transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input.
The second one, which we call it associative grounding, harnesses image retrieval to match texts with related images during both
arXiv Detail & Related papers (2021-03-25T16:03:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.