Improving Generation and Evaluation of Visual Stories via Semantic
Consistency
- URL: http://arxiv.org/abs/2105.10026v1
- Date: Thu, 20 May 2021 20:42:42 GMT
- Title: Improving Generation and Evaluation of Visual Stories via Semantic
Consistency
- Authors: Adyasha Maharana, Darryl Hannan, Mohit Bansal
- Abstract summary: Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
- Score: 72.00815192668193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Story visualization is an under-explored task that falls at the intersection
of many important research directions in both computer vision and natural
language processing. In this task, given a series of natural language captions
which compose a story, an agent must generate a sequence of images that
correspond to the captions. Prior work has introduced recurrent generative
models which outperform text-to-image synthesis models on this task. However,
there is room for improvement of generated images in terms of visual quality,
coherence and relevance. We present a number of improvements to prior modeling
approaches, including (1) the addition of a dual learning framework that
utilizes video captioning to reinforce the semantic alignment between the story
and generated images, (2) a copy-transform mechanism for
sequentially-consistent story visualization, and (3) MART-based transformers to
model complex interactions between frames. We present ablation studies to
demonstrate the effect of each of these techniques on the generative power of
the model for both individual images as well as the entire narrative.
Furthermore, due to the complexity and generative nature of the task, standard
evaluation metrics do not accurately reflect performance. Therefore, we also
provide an exploration of evaluation metrics for the model, focused on aspects
of the generated frames such as the presence/quality of generated characters,
the relevance to captions, and the diversity of the generated images. We also
present correlation experiments of our proposed automated metrics with human
evaluations. Code and data available at:
https://github.com/adymaharana/StoryViz
Related papers
- TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.