NarrativeTrack: Evaluating Video Language Models Beyond the Frame
- URL: http://arxiv.org/abs/2601.01095v1
- Date: Sat, 03 Jan 2026 07:12:55 GMT
- Title: NarrativeTrack: Evaluating Video Language Models Beyond the Frame
- Authors: Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty,
- Abstract summary: We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs.<n>We decompose videos into constituent entities and examine their continuity via a Compositional Reasoning (CRP) framework.<n>CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning.
- Score: 10.244330591706744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entity's contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.
Related papers
- How does longer temporal context enhance multimodal narrative video processing in the brain? [39.57117698934923]
This study investigates how the temporal context length of video clips and the narrative-task prompting shape brain-model alignment during naturalistic movie watching.<n>We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs)<n>Shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions.
arXiv Detail & Related papers (2026-02-07T14:34:00Z) - From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning [12.903267405917388]
We propose MADI, a multi-modal large language model (MLLM) enhanced with fine-grained alignment and disentangled interaction.<n>Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.
arXiv Detail & Related papers (2026-01-29T09:13:46Z) - Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search [61.88597038104749]
We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning.<n>We preserve semantic consistency by integrating entity-level representations across visual and auditory streams.<n>We employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers.
arXiv Detail & Related papers (2026-01-20T08:23:29Z) - Priors in Time: Missing Inductive Biases for Language Model Interpretability [58.07412640266836]
We show that Sparse Autoencoders impose priors that assume independence of concepts across time, implying stationarity.<n>We introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts.<n>Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.
arXiv Detail & Related papers (2025-11-03T18:43:48Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data [100.5266292850922]
Strefer is a synthetic data generation framework designed to equip Video Large Models with referring and reasoning capabilities.<n>Strefer produces diverse instruction-generation data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata.<n>Our approach enhances the ability of Video LLMs to interpret to spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions.
arXiv Detail & Related papers (2025-09-03T17:33:20Z) - Causality Matters: How Temporal Information Emerges in Video Language Models [17.570777893613137]
We find that removing or modifying positional encodings in video inputs yields minimal degradation in the performance of temporal understanding.<n>To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model.<n>We propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation.
arXiv Detail & Related papers (2025-08-15T16:33:14Z) - VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z) - Do Language Models Understand Time? [2.290956583394892]
Large language models (LLMs) have revolutionized video-based computer vision applications, including action recognition, anomaly detection, and summarization.<n>This work critically examines the role of LLMs in video processing, with a specific focus on their temporal reasoning capabilities.<n>We analyze challenges posed by existing video datasets, including biases, lack of temporal annotations, and domain-specific limitations that constrain the temporal understanding of LLMs.
arXiv Detail & Related papers (2024-12-18T13:38:06Z) - Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models [44.99833362998488]
Temporal Semantic Alignment via Dynamic Prompting (TSADP) is a novel framework that enhances temporal reasoning capabilities.<n>We evaluate TSADP on the VidSitu dataset, augmented with enriched temporal annotations.<n>Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
arXiv Detail & Related papers (2024-12-16T02:37:58Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.