Generating Storytelling Images with Rich Chains-of-Reasoning
- URL: http://arxiv.org/abs/2512.07198v1
- Date: Mon, 08 Dec 2025 06:18:44 GMT
- Title: Generating Storytelling Images with Rich Chains-of-Reasoning
- Authors: Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen, Mengyue Wu, Kenny Q. Zhu,
- Abstract summary: We focus on semantically rich images and define them as Storytelling Images.<n>Storytelling Images have diverse applications beyond illustration creation and cognitive screening.<n>We introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images.
- Score: 38.363486512993816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.
Related papers
- LLMs Behind the Scenes: Enabling Narrative Scene Illustration [16.01007495951529]
We focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story.<n>Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models.<n>We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories.
arXiv Detail & Related papers (2025-09-26T21:15:18Z) - From Image Captioning to Visual Storytelling [2.0333131475480917]
The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning.<n>This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods.<n>Our evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories.
arXiv Detail & Related papers (2025-07-31T16:44:23Z) - StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion [78.1014542102578]
Story visualization aims to generate realistic and coherent images based on a storyline.
Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner.
We propose a bidirectional, unified, and efficient framework, namely StoryImager.
arXiv Detail & Related papers (2024-04-09T03:22:36Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual
Storytelling via Multi-Layered Semantic-Aware Denoising [42.20750912837316]
MagicScroll is a progressive diffusion-based image generation framework with a novel semantic-aware denoising process.
It enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions.
It showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience.
arXiv Detail & Related papers (2023-12-18T03:09:05Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling [86.42719129731907]
We propose to explicitly learn to imagine a storyline that bridges the visual gap.
We train the network to produce a full plausible story even with missing photo(s)
In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
arXiv Detail & Related papers (2020-02-03T14:22:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.