Related papers: Re:Verse -- Can Your VLM Read a Manga?

Re:Verse -- Can Your VLM Read a Manga?

URL: http://arxiv.org/abs/2508.08508v3
Date: Mon, 18 Aug 2025 04:59:14 GMT
Title: Re:Verse -- Can Your VLM Read a Manga?
Authors: Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Yogesh S Rawat, Shruti Vyas,
Abstract summary: Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning.<n>We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment.<n>We conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning.
Score: 14.057881684215047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs' joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models. Project Page: https://re-verse.vercel.app

Related papers

Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval [67.73095846666583]
Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition.<n>This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era.
arXiv Detail & Related papers (2026-02-23T15:27:41Z)
Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.27548620675748]
Compositional visual reasoning has emerged as a key research frontier in multimodal AI.<n>This survey systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.)<n>We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception.
arXiv Detail & Related papers (2025-08-24T11:01:51Z)
KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge [1.5833270109954136]
We propose KnowDR-REC, built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image.<n>We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks.
arXiv Detail & Related papers (2025-08-12T19:43:44Z)
Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives [84.03001845263]
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management.<n>We propose two novel dynamic macrostructural approaches to measure cross-modal consistency between speech and visual stimuli.<n> Experimental results validated the efficiency of proposed approaches in NCD detection, with TITAN achieving superior performance both on the CU-MARVEL-RABBIT corpus and the ADReSS corpus.
arXiv Detail & Related papers (2025-01-07T12:16:26Z)
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z)
Fine-Grained Modeling of Narrative Context: A Coherence Perspective via Retrospective Questions [48.18584733906447]
This work introduces an original and practical paradigm for narrative comprehension, stemming from the characteristics that individual passages within narratives tend to be more cohesively related than isolated. We propose a fine-grained modeling of narrative context, by formulating a graph dubbed NarCo, which explicitly depicts task-agnostic coherence dependencies.
arXiv Detail & Related papers (2024-02-21T06:14:04Z)
Conflicts, Villains, Resolutions: Towards models of Narrative Media Framing [19.589945994234075]
We revisit a widely used conceptualization of framing from the communication sciences which explicitly captures elements of narratives. We adapt an effective annotation paradigm that breaks a complex annotation task into a series of simpler binary questions. We explore automatic multi-label prediction of our frames with supervised and semi-supervised approaches.
arXiv Detail & Related papers (2023-06-03T08:50:13Z)
M-SENSE: Modeling Narrative Structure in Short Personal Narratives Using Protagonist's Mental Representations [14.64546899992196]
We propose the task of automatically detecting prominent elements of the narrative structure by analyzing the role of characters' inferred mental state. We introduce a STORIES dataset of short personal narratives containing manual annotations of key elements of narrative structure, specifically climax and resolution. Our model is able to achieve significant improvements in the task of identifying climax and resolution.
arXiv Detail & Related papers (2023-02-18T20:48:02Z)
Mixed Multi-Model Semantic Interaction for Graph-based Narrative Visualizations [10.193264105560862]
Narrative maps are a visual representation model that can assist analysts to understand narratives. We present a semantic interaction framework for narrative maps that can support analysts through their sensemaking process. We find that our SI system can model the analysts' intent and support incremental formalism for narrative maps.
arXiv Detail & Related papers (2023-02-13T15:32:10Z)
A Focused Study on Sequence Length for Dialogue Summarization [68.73335643440957]
We analyze the length differences between existing models' outputs and the corresponding human references. We identify salient features for summary length prediction by comparing different model settings. Third, we experiment with a length-aware summarizer and show notable improvement on existing models if summary length can be well incorporated.
arXiv Detail & Related papers (2022-09-24T02:49:48Z)
SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries. We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries. Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z)
Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling [39.59158974352266]
Visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images. Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images. To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed.
arXiv Detail & Related papers (2022-03-10T12:55:47Z)
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.