Reading Between the Lines: Exploring Infilling in Visual Narratives
- URL: http://arxiv.org/abs/2010.13944v1
- Date: Mon, 26 Oct 2020 23:09:09 GMT
- Title: Reading Between the Lines: Exploring Infilling in Visual Narratives
- Authors: Khyathi Raghavi Chandu, Ruo-Ping Dong, Alan Black
- Abstract summary: We present a new large scale textitvisual procedure telling (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images.
We conclusively show a METEOR score of 27.51 on procedures which is higher than the state-of-the-art on visual storytelling.
- Score: 5.28005598366543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating long form narratives such as stories and procedures from multiple
modalities has been a long standing dream for artificial intelligence. In this
regard, there is often crucial subtext that is derived from the surrounding
contexts. The general seq2seq training methods render the models shorthanded
while attempting to bridge the gap between these neighbouring contexts. In this
paper, we tackle this problem by using \textit{infilling} techniques involving
prediction of missing steps in a narrative while generating textual
descriptions from a sequence of images. We also present a new large scale
\textit{visual procedure telling} (ViPT) dataset with a total of 46,200
procedures and around 340k pairwise images and textual descriptions that is
rich in such contextual dependencies. Generating steps using infilling
technique demonstrates the effectiveness in visual procedures with more
coherent texts. We conclusively show a METEOR score of 27.51 on procedures
which is higher than the state-of-the-art on visual storytelling. We also
demonstrate the effects of interposing new text with missing images during
inference. The code and the dataset will be publicly available at
https://visual-narratives.github.io/Visual-Narratives/.
Related papers
- Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text.
We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z) - Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Image Captioning with Multi-Context Synthetic Data [16.961112970612447]
Large models have excelled in producing high-quality images and text.
We present an innovative pipeline that introduces multi-context data generation.
Our model is exclusively trained on synthetic image-text pairs crafted through this process.
arXiv Detail & Related papers (2023-05-29T13:18:59Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.