MAGVLT: Masked Generative Vision-and-Language Transformer
- URL: http://arxiv.org/abs/2303.12208v1
- Date: Tue, 21 Mar 2023 21:49:39 GMT
- Title: MAGVLT: Masked Generative Vision-and-Language Transformer
- Authors: Sungwoong Kim, Daejin Jo, Donghoon Lee, Jongmin Kim
- Abstract summary: We explore a unified generative vision-and-language model that can produce both images and text sequences.
We propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT)
For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks.
- Score: 15.796199345773879
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While generative modeling on multimodal image-text data has been actively
developed with large-scale paired datasets, there have been limited attempts to
generate both image and text data by a single model rather than a generation of
one fixed modality conditioned on the other modality. In this paper, we explore
a unified generative vision-and-language (VL) model that can produce both
images and text sequences. Especially, we propose a generative VL transformer
based on the non-autoregressive mask prediction, named MAGVLT, and compare it
with an autoregressive generative VL transformer (ARGVLT). In comparison to
ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast
decoding by parallel token predictions in an iterative refinement, and extended
editing capabilities such as image and text infilling. For rigorous training of
our MAGVLT with image-text pairs from scratch, we combine the image-to-text,
text-to-image, and joint image-and-text mask prediction tasks. Moreover, we
devise two additional tasks based on the step-unrolled mask prediction and the
selective prediction on the mixture of two image-text pairs. Experimental
results on various downstream generation tasks of VL benchmarks show that our
MAGVLT outperforms ARGVLT by a large margin even with significant inference
speedup. Particularly, MAGVLT achieves competitive results on both zero-shot
image-to-text and text-to-image generation tasks from MS-COCO by one
moderate-sized model (fewer than 500M parameters) even without the use of
monomodal data and networks.
Related papers
- Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks.
We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z) - VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language
Transformer Decomposing [7.890230091463883]
Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval.
We propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text.
arXiv Detail & Related papers (2021-10-20T09:00:51Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.