Unifying Multimodal Transformer for Bi-directional Image and Text
Generation
- URL: http://arxiv.org/abs/2110.09753v1
- Date: Tue, 19 Oct 2021 06:01:24 GMT
- Title: Unifying Multimodal Transformer for Bi-directional Image and Text
Generation
- Authors: Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu
- Abstract summary: We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks.
We propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks.
- Score: 8.547205551848462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the joint learning of image-to-text and text-to-image generations,
which are naturally bi-directional tasks. Typical existing works design two
separate task-specific models for each task, which impose expensive design
efforts. In this work, we propose a unified image-and-text generative framework
based on a single multimodal model to jointly study the bi-directional tasks.
We adopt Transformer as our unified architecture for its strong performance and
task-agnostic design. Specifically, we formulate both tasks as sequence
generation tasks, where we represent images and text as unified sequences of
tokens, and the Transformer learns multimodal interactions to generate
sequences. We further propose two-level granularity feature representations and
sequence-level training to improve the Transformer-based unified framework.
Experiments show that our approach significantly improves previous
Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for
text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for
fine-tuned image-to-text generation on the MS-COCO dataset. Our code is
available online.
Related papers
- Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - CogView2: Faster and Better Text-to-Image Generation via Hierarchical
Transformers [17.757983821569994]
A new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2.
arXiv Detail & Related papers (2022-04-28T15:51:11Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model [58.17021225930069]
We explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA)
We propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly.
Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works.
arXiv Detail & Related papers (2021-05-31T16:20:03Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.