Learning to Imagine: Visually-Augmented Natural Language Generation
- URL: http://arxiv.org/abs/2305.16944v3
- Date: Thu, 15 Jun 2023 06:25:34 GMT
- Title: Learning to Imagine: Visually-Augmented Natural Language Generation
- Authors: Tianyi Tang, Yushuo Chen, Yifan Du, Junyi Li, Wayne Xin Zhao, and
Ji-Rong Wen
- Abstract summary: We propose a method to make pre-trained language models (PLMs) Learn to Imagine for Visuallyaugmented natural language gEneration.
We use a diffusion model to synthesize high-quality images conditioned on the input texts.
We conduct synthesis for each sentence rather than generate only one image for an entire paragraph.
- Score: 73.65760028876943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People often imagine relevant scenes to aid in the writing process. In this
work, we aim to utilize visual information for composition in the same manner
as humans. We propose a method, LIVE, that makes pre-trained language models
(PLMs) Learn to Imagine for Visuallyaugmented natural language gEneration.
First, we imagine the scene based on the text: we use a diffusion model to
synthesize high-quality images conditioned on the input texts. Second, we use
CLIP to determine whether the text can evoke the imagination in a posterior
way. Finally, our imagination is dynamic, and we conduct synthesis for each
sentence rather than generate only one image for an entire paragraph.
Technically, we propose a novel plug-and-play fusion layer to obtain
visually-augmented representations for each text. Our vision-text fusion layer
is compatible with Transformerbased architecture. We have conducted extensive
experiments on four generation tasks using BART and T5, and the automatic
results and human evaluation demonstrate the effectiveness of our proposed
method. We will release the code, model, and data at the link:
https://github.com/RUCAIBox/LIVE.
Related papers
- VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Beyond Generation: Harnessing Text to Image Models for Object Detection
and Segmentation [29.274362919954218]
We propose a new paradigm to automatically generate training data with accurate labels at scale.
The proposed approach decouples training data generation into foreground object generation, and contextually coherent background generation.
We demonstrate the advantages of our approach on five object detection and segmentation datasets.
arXiv Detail & Related papers (2023-09-12T04:41:45Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Visualize Before You Write: Imagination-Guided Open-Ended Text
Generation [68.96699389728964]
We propose iNLG that uses machine-generated images to guide language models in open-ended text generation.
Experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks.
arXiv Detail & Related papers (2022-10-07T18:01:09Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural
Language Generation [53.56628907030751]
We propose ImaginE, an imagination-based automatic evaluation metric for natural language generation.
With the help of CLIP and DALL-E, two cross-modal models pre-trained on large-scale image-text pairs, we automatically generate an image as the embodied imagination for the text snippet.
Experiments spanning several text generation tasks demonstrate that adding imagination with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation.
arXiv Detail & Related papers (2021-06-10T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.