Related papers: From Image Captioning to Visual Storytelling

From Image Captioning to Visual Storytelling

URL: http://arxiv.org/abs/2508.14045v1
Date: Thu, 31 Jul 2025 16:44:23 GMT
Title: From Image Captioning to Visual Storytelling
Authors: Admitos Passadakis, Yingjin Song, Albert Gatt,
Abstract summary: The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning.<n>This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods.<n>Our evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories.
Score: 2.0333131475480917
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.

Related papers

Generating Storytelling Images with Rich Chains-of-Reasoning [38.363486512993816]
We focus on semantically rich images and define them as Storytelling Images.<n>Storytelling Images have diverse applications beyond illustration creation and cognitive screening.<n>We introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images.
arXiv Detail & Related papers (2025-12-08T06:18:44Z)
Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z)
TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST) In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z)
Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z)
Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem. We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z)
Album Storytelling with Iterative Story-aware Captioning and Large Language Models [86.6548090965982]
We study how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling" With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text. Our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.
arXiv Detail & Related papers (2023-05-22T11:45:10Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling [86.42719129731907]
We propose to explicitly learn to imagine a storyline that bridges the visual gap. We train the network to produce a full plausible story even with missing photo(s) In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
arXiv Detail & Related papers (2020-02-03T14:22:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.