Word-Level Fine-Grained Story Visualization
- URL: http://arxiv.org/abs/2208.02341v1
- Date: Wed, 3 Aug 2022 21:01:47 GMT
- Title: Word-Level Fine-Grained Story Visualization
- Authors: Bowen Li, Thomas Lukasiewicz
- Abstract summary: Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
- Score: 58.16484259508973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Story visualization aims to generate a sequence of images to narrate each
sentence in a multi-sentence story with a global consistency across dynamic
scenes and characters. Current works still struggle with output images' quality
and consistency, and rely on additional semantic information or auxiliary
captioning networks. To address these challenges, we first introduce a new
sentence representation, which incorporates word information from all story
sentences to mitigate the inconsistency problem. Then, we propose a new
discriminator with fusion features and further extend the spatial attention to
improve image quality and story consistency. Extensive experiments on different
datasets and human evaluation demonstrate the superior performance of our
approach, compared to state-of-the-art methods, neither using segmentation
masks nor auxiliary captioning networks.
Related papers
- Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities.
We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Towards Coherent Visual Storytelling with Ordered Image Attention [73.422281039592]
We develop ordered image attention (OIA) and Image-Sentence Attention (ISA)
OIA models interactions between the sentence-corresponding image and important regions in other images of the sequence.
To generate the story's sentences, we then highlight important image attention vectors with an Image-Sentence Attention (ISA)
arXiv Detail & Related papers (2021-08-04T17:12:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.