Related papers: VinaBench: Benchmark for Faithful and Consistent Visual Narratives

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

URL: http://arxiv.org/abs/2503.20871v3
Date: Thu, 03 Apr 2025 09:28:19 GMT
Title: VinaBench: Benchmark for Faithful and Consistent Visual Narratives
Authors: Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut,
Abstract summary: We propose a new benchmark, VinaBench, to address the challenge of generating faithful visual narratives.<n>Our results demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
Score: 29.111073358773698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.

Related papers

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? [42.362388367152256]
This paper presents a novel approach that leverages recent advancements in multimodal models for the visual storytelling task. We utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy.
arXiv Detail & Related papers (2025-04-27T14:55:51Z)
Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities. We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z)
ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context [50.572907418430155]
ContextualStory is a framework designed to generate coherent story frames and extend frames for visual storytelling.<n>We introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames.<n>Experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.
arXiv Detail & Related papers (2024-07-13T05:02:42Z)
MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising [42.20750912837316]
MagicScroll is a progressive diffusion-based image generation framework with a novel semantic-aware denoising process. It enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions. It showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience.
arXiv Detail & Related papers (2023-12-18T03:09:05Z)
Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z)
Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology [7.962160810367763]
We present five themes that characterize the variations found in this creative visual storytelling process. We envision narrative intelligence criteria for computational visual storytelling as: creative, reliable, expressive, grounded, and responsible.
arXiv Detail & Related papers (2023-10-06T18:47:20Z)
Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem. We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z)
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module. We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z)
Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context. Our method outperforms prior state-of-the-art in generating frames with high visual quality. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters. Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks. We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem. Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z)
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling [86.42719129731907]
We propose to explicitly learn to imagine a storyline that bridges the visual gap. We train the network to produce a full plausible story even with missing photo(s) In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
arXiv Detail & Related papers (2020-02-03T14:22:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.