X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal
Transformers
- URL: http://arxiv.org/abs/2009.11278v1
- Date: Wed, 23 Sep 2020 17:45:17 GMT
- Title: X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal
Transformers
- Authors: Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha
Kembhavi
- Abstract summary: Masked language models like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks.
Recent work has also successfully adapted such models towards the generative task of image captioning.
This begs the question: Can these models go the other way and generate images from pieces of text?
- Score: 49.851202669815954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mirroring the success of masked language models, vision-and-language
counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art
performance on a variety of multimodal discriminative tasks like visual
question answering and visual grounding. Recent work has also successfully
adapted such models towards the generative task of image captioning. This begs
the question: Can these models go the other way and generate images from pieces
of text? Our analysis of a popular representative from this model family -
LXMERT - finds that it is unable to generate rich and semantically meaningful
imagery with its current training setup. We introduce X-LXMERT, an extension to
LXMERT with training refinements including: discretizing visual
representations, using uniform masking with a large range of masking ratios and
aligning the right pre-training datasets to the right objectives which enables
it to paint. X-LXMERT's image generation capabilities rival state of the art
generative models while its question answering and captioning abilities remains
comparable to LXMERT. Finally, we demonstrate the generality of these training
refinements by adding image generation capabilities into UNITER to produce
X-UNITER.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation.
We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell.
Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.