XGPT: Cross-modal Generative Pre-Training for Image Captioning
- URL: http://arxiv.org/abs/2003.01473v2
- Date: Wed, 4 Mar 2020 07:56:09 GMT
- Title: XGPT: Cross-modal Generative Pre-Training for Image Captioning
- Authors: Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang
Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou
- Abstract summary: XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
- Score: 80.26456233277435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While many BERT-based cross-modal pre-trained models produce excellent
results on downstream understanding tasks like image-text retrieval and VQA,
they cannot be applied to generation tasks directly. In this paper, we propose
XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning
that is designed to pre-train text-to-image caption generators through three
novel generation tasks, including Image-conditioned Masked Language Modeling
(IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned
Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be
fine-tuned without any task-specific architecture modifications to create
state-of-the-art models for image captioning. Experiments show that XGPT
obtains new state-of-the-art results on the benchmark datasets, including COCO
Captions and Flickr30k Captions. We also use XGPT to generate new image
captions as data augmentation for the image retrieval task and achieve
significant improvement on all recall metrics.
Related papers
- Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal
Transformers [49.851202669815954]
Masked language models like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks.
Recent work has also successfully adapted such models towards the generative task of image captioning.
This begs the question: Can these models go the other way and generate images from pieces of text?
arXiv Detail & Related papers (2020-09-23T17:45:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.