Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling
- URL: http://arxiv.org/abs/2002.00774v1
- Date: Mon, 3 Feb 2020 14:22:18 GMT
- Title: Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling
- Authors: Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, In So
Kweon
- Abstract summary: We propose to explicitly learn to imagine a storyline that bridges the visual gap.
We train the network to produce a full plausible story even with missing photo(s)
In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
- Score: 86.42719129731907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual storytelling is a task of creating a short story based on photo
streams. Unlike existing visual captioning, storytelling aims to contain not
only factual descriptions, but also human-like narration and semantics.
However, the VIST dataset consists only of a small, fixed number of photos per
story. Therefore, the main challenge of visual storytelling is to fill in the
visual gap between photos with narrative and imaginative story. In this paper,
we propose to explicitly learn to imagine a storyline that bridges the visual
gap. During training, one or more photos is randomly omitted from the input
stack, and we train the network to produce a full plausible story even with
missing photo(s). Furthermore, we propose for visual storytelling a
hide-and-tell model, which is designed to learn non-local relations across the
photo streams and to refine and improve conventional RNN-based models. In
experiments, we show that our scheme of hide-and-tell, and the network design
are indeed effective at storytelling, and that our model outperforms previous
state-of-the-art methods in automatic metrics. Finally, we qualitatively show
the learned ability to interpolate storyline over visual gaps.
Related papers
- TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
Storytelling [12.560014305032437]
This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations.
SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights.
This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm.
arXiv Detail & Related papers (2024-02-01T04:09:17Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Album Storytelling with Iterative Story-aware Captioning and Large
Language Models [86.6548090965982]
We study how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling"
With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text.
Our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.
arXiv Detail & Related papers (2023-05-22T11:45:10Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.