Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics
- URL: http://arxiv.org/abs/2506.10008v1
- Date: Mon, 14 Apr 2025 14:42:19 GMT
- Title: Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics
- Authors: Yi-Chun Chen,
- Abstract summary: This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on comics.<n>It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships.<n>At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions.
- Score: 1.320904960556043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on multimodal media such as comics. The proposed method decomposes narrative content into multiple levels, from macro-level story arcs to fine-grained event segments. It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships. At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions. These graphs are integrated across narrative levels to support reasoning over story structure, character continuity, and event progression. We apply our approach to a manually annotated subset of the Manga109 dataset and demonstrate its ability to support symbolic reasoning across diverse narrative tasks, including action retrieval, dialogue tracing, character appearance mapping, and panel timeline reconstruction. Evaluation results show high precision and recall across tasks, validating the coherence and interpretability of the framework. This work contributes a scalable foundation for narrative-based content analysis, interactive storytelling, and multimodal reasoning in visual media.
Related papers
- Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs [0.8702432681310401]
Aether Weaver is a novel framework for narrative co-generation that overcomes limitations of multimodal text-to-visual pipelines.<n>Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes.
arXiv Detail & Related papers (2025-07-29T15:01:31Z) - StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation [0.2455468619225742]
Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects.<n>We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images.<n>We create Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story.
arXiv Detail & Related papers (2025-05-15T13:42:14Z) - VisAgent: Narrative-Preserving Story Visualization Framework [5.86192577938549]
VisAgent is a training-free framework designed to comprehend and visualize pivotal scenes within a given story.<n>By considering story distillation, semantic consistency, and contextual coherence, VisAgent employs an agentic workflow.<n>The empirically validated effectiveness confirms the framework's suitability for practical story visualization applications.
arXiv Detail & Related papers (2025-03-04T08:41:45Z) - ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context [50.572907418430155]
ContextualStory is a framework designed to generate coherent story frames and extend frames for visual storytelling.<n>We introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames.<n>Experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.
arXiv Detail & Related papers (2024-07-13T05:02:42Z) - SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
Storytelling [12.560014305032437]
This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations.
SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights.
This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm.
arXiv Detail & Related papers (2024-02-01T04:09:17Z) - Panel Transitions for Genre Analysis in Visual Narratives [1.320904960556043]
We present a novel approach to do a multi-modal analysis of genre based on comics and manga-style visual narratives.
We highlight some of the limitations and challenges of our existing computational approaches in modeling subjective labels.
arXiv Detail & Related papers (2023-12-14T08:05:09Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.