ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation
- URL: http://arxiv.org/abs/2112.15283v1
- Date: Fri, 31 Dec 2021 03:53:33 GMT
- Title: ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation
- Authors: Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua
Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
- Abstract summary: We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
- Score: 22.47279425592133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional methods for the image-text generation tasks mainly tackle the
naturally bidirectional generation tasks separately, focusing on designing
task-specific frameworks to improve the quality and fidelity of the generated
samples. Recently, Vision-Language Pre-training models have greatly improved
the performance of the image-to-text generation tasks, but large-scale
pre-training models for text-to-image synthesis task are still under-developed.
In this paper, we propose ERNIE-ViLG, a unified generative pre-training
framework for bidirectional image-text generation with transformer model. Based
on the image quantization models, we formulate both image generation and text
generation as autoregressive generative tasks conditioned on the text/image
input. The bidirectional image-text generative modeling eases the semantic
alignments across vision and language. For the text-to-image generation
process, we further propose an end-to-end training method to jointly learn the
visual sequence generator and the image reconstructor. To explore the landscape
of large-scale pre-training for bidirectional text-image generation, we train a
10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million
(Chinese) image-text pairs which achieves state-of-the-art performance for both
text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS-COCO for
text-to-image synthesis and best results on COCO-CN and AIC-ICC for image
captioning.
Related papers
- Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - DU-VLG: Unifying Vision-and-Language Generation via Dual
Sequence-to-Sequence Pre-training [37.15272352614968]
We propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems.
Du-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks.
Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss.
arXiv Detail & Related papers (2022-03-17T03:18:22Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Unifying Multimodal Transformer for Bi-directional Image and Text
Generation [8.547205551848462]
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks.
We propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks.
arXiv Detail & Related papers (2021-10-19T06:01:24Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.