ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
- URL: http://arxiv.org/abs/2505.24862v2
- Date: Wed, 25 Jun 2025 14:57:33 GMT
- Title: ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
- Authors: Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Zhewei Huang, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Gang Yu, Chi Zhang,
- Abstract summary: ViStoryBench is an evaluation benchmark for story visualization models.<n>It features stories with single and multiple protagonists to test models' ability to maintain character consistency.<n>It includes complex plots and intricate world-building to challenge models in generating accurate visuals.
- Score: 23.274981415638837
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models' ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.
Related papers
- STAGE: A Benchmark for Knowledge Graph Construction, Question Answering, and In-Script Role-Playing over Movie Screenplays [16.069095458601588]
We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays.<n> STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation.<n>The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate
arXiv Detail & Related papers (2026-01-13T12:50:58Z) - StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation [0.2455468619225742]
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects.<n>We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images.<n>We create Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story.
arXiv Detail & Related papers (2025-05-15T13:42:14Z) - Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming [44.32980579195508]
We introduce Generate Any Scene, a framework that enumerates scene graphs representing a vast array of visual scenes.<n> Generate Any Scene translates each scene graph into a caption, enabling scalable evaluation of text-to-vision models.<n>We conduct extensive evaluations across text-to-image, text-to-video, and text-to-3D models, presenting key findings on model performance.
arXiv Detail & Related papers (2024-12-11T09:17:39Z) - StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization [36.14275850149665]
We propose a novel knowledge graph, namely Character Graph (textbfCG), which comprehensively represents various story-related knowledge.<n>We then introduce StoryWeaver, an image generator that achieve Customization via Character Graph (textbfC-CG), capable of consistent story visualization with rich text semantics.
arXiv Detail & Related papers (2024-12-10T10:16:50Z) - What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation [29.42202665594218]
We introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes.<n>Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, and SGScore, a novel evaluation metric.<n>We develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image.
arXiv Detail & Related papers (2024-11-23T03:40:25Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.<n>Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.<n>We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context [50.572907418430155]
ContextualStory is a framework designed to generate coherent story frames and extend frames for visual storytelling.<n>We introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames.<n>Experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.
arXiv Detail & Related papers (2024-07-13T05:02:42Z) - Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models [79.21968152209193]
We introduce the NewEpisode benchmark to evaluate generative models' adaptability in generating new stories with fresh characters.
We propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics.
arXiv Detail & Related papers (2024-05-20T07:54:03Z) - StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion [78.1014542102578]
Story visualization aims to generate realistic and coherent images based on a storyline.
Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner.
We propose a bidirectional, unified, and efficient framework, namely StoryImager.
arXiv Detail & Related papers (2024-04-09T03:22:36Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Panel Transitions for Genre Analysis in Visual Narratives [1.320904960556043]
We present a novel approach to do a multi-modal analysis of genre based on comics and manga-style visual narratives.
We highlight some of the limitations and challenges of our existing computational approaches in modeling subjective labels.
arXiv Detail & Related papers (2023-12-14T08:05:09Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.