Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization
- URL: http://arxiv.org/abs/2110.10834v1
- Date: Thu, 21 Oct 2021 00:16:02 GMT
- Title: Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization
- Authors: Adyasha Maharana, Mohit Bansal
- Abstract summary: We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
- Score: 81.26077816854449
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While much research has been done in text-to-image synthesis, little work has
been done to explore the usage of linguistic structure of the input text. Such
information is even more important for story visualization since its inputs
have an explicit narrative structure that needs to be translated into an image
sequence (or visual story). Prior work in this domain has shown that there is
ample room for improvement in the generated image sequence in terms of visual
quality, consistency and relevance. In this paper, we first explore the use of
constituency parse trees using a Transformer-based recurrent architecture for
encoding structured input. Second, we augment the structured input with
commonsense information and study the impact of this external knowledge on the
generation of visual story. Third, we also incorporate visual structure via
bounding boxes and dense captioning to provide feedback about the
characters/objects in generated images within a dual learning setup. We show
that off-the-shelf dense-captioning models trained on Visual Genome can improve
the spatial structure of images from a different target domain without needing
fine-tuning. We train the model end-to-end using intra-story contrastive loss
(between words and image sub-regions) and show significant improvements in
several metrics (and human evaluation) for multiple datasets. Finally, we
provide an analysis of the linguistic and visuo-spatial information. Code and
data: https://github.com/adymaharana/VLCStoryGan.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.