Related papers: Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

URL: http://arxiv.org/abs/2002.00774v1
Date: Mon, 3 Feb 2020 14:22:18 GMT
Title: Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling
Authors: Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, In So Kweon
Abstract summary: We propose to explicitly learn to imagine a storyline that bridges the visual gap. We train the network to produce a full plausible story even with missing photo(s) In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
Score: 86.42719129731907
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual storytelling is a task of creating a short story based on photo streams. Unlike existing visual captioning, storytelling aims to contain not only factual descriptions, but also human-like narration and semantics. However, the VIST dataset consists only of a small, fixed number of photos per story. Therefore, the main challenge of visual storytelling is to fill in the visual gap between photos with narrative and imaginative story. In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap. During training, one or more photos is randomly omitted from the input stack, and we train the network to produce a full plausible story even with missing photo(s). Furthermore, we propose for visual storytelling a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models. In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling, and that our model outperforms previous state-of-the-art methods in automatic metrics. Finally, we qualitatively show the learned ability to interpolate storyline over visual gaps.

Related papers

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? [42.362388367152256]
This paper presents a novel approach that leverages recent advancements in multimodal models for the visual storytelling task. We utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy.
arXiv Detail & Related papers (2025-04-27T14:55:51Z)
TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST) In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z)
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling [12.560014305032437]
This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm.
arXiv Detail & Related papers (2024-02-01T04:09:17Z)
Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z)
Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem. We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z)
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module. We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z)
Album Storytelling with Iterative Story-aware Captioning and Large Language Models [86.6548090965982]
We study how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling" With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text. Our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.
arXiv Detail & Related papers (2023-05-22T11:45:10Z)
What You Say Is What You Show: Visual Narration Detection in Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video. We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data. Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.