DU-VLG: Unifying Vision-and-Language Generation via Dual
Sequence-to-Sequence Pre-training
- URL: http://arxiv.org/abs/2203.09052v1
- Date: Thu, 17 Mar 2022 03:18:22 GMT
- Title: DU-VLG: Unifying Vision-and-Language Generation via Dual
Sequence-to-Sequence Pre-training
- Authors: Luyang Huang, Guocheng Niu, Jiachen Liu, Xinyan Xiao, Hua Wu
- Abstract summary: We propose DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems.
Du-VLG is trained with novel dual pre-training tasks: multi-modal denoising autoencoder tasks and modality translation tasks.
Results show that DU-VLG yields better performance than variants trained with uni-directional generation objectives or the variant without the commitment loss.
- Score: 37.15272352614968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the limitations of the model structure and pre-training objectives,
existing vision-and-language generation models cannot utilize pair-wise images
and text through bi-directional generation. In this paper, we propose DU-VLG, a
framework which unifies vision-and-language generation as sequence generation
problems. DU-VLG is trained with novel dual pre-training tasks: multi-modal
denoising autoencoder tasks and modality translation tasks. To bridge the gap
between image understanding and generation, we further design a novel
commitment loss. We compare pre-training objectives on image captioning and
text-to-image generation datasets. Results show that DU-VLG yields better
performance than variants trained with uni-directional generation objectives or
the variant without the commitment loss. We also obtain higher scores compared
to previous state-of-the-art systems on three vision-and-language generation
tasks. In addition, human judges further confirm that our model generates real
and relevant images as well as faithful and informative captions.
Related papers
- Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.