Related papers: How does longer temporal context enhance multimodal narrative video processing in the brain?

How does longer temporal context enhance multimodal narrative video processing in the brain?

URL: http://arxiv.org/abs/2602.07570v1
Date: Sat, 07 Feb 2026 14:34:00 GMT
Title: How does longer temporal context enhance multimodal narrative video processing in the brain?
Authors: Prachi Jindal, Anant Khandelwal, Manish Gupta, Bapi S. Raju, Subba Reddy Oota, Tanmoy Chakraborty,
Abstract summary: This study investigates how the temporal context length of video clips and the narrative-task prompting shape brain-model alignment during naturalistic movie watching.<n>We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs)<n>Shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions.
Score: 39.57117698934923
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.

Related papers

NarrativeTrack: Evaluating Video Language Models Beyond the Frame [10.244330591706744]
We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs.<n>We decompose videos into constituent entities and examine their continuity via a Compositional Reasoning (CRP) framework.<n>CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning.
arXiv Detail & Related papers (2026-01-03T07:12:55Z)
Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z)
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding [19.50051728766238]
We propose an innovative video representation method called Dynamic-Image (DynImg)<n>Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects.<n>During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions.
arXiv Detail & Related papers (2025-07-21T12:50:49Z)
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [55.32878803528196]
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos.<n>We propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding.<n>To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning.
arXiv Detail & Related papers (2025-05-27T04:50:07Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
The Temporal Structure of Language Processing in the Human Brain Corresponds to The Layered Hierarchy of Deep Language Models [37.605014098041906]
We show that the layered hierarchy of Deep Language Models (DLMs) may be used to model the temporal dynamics of language comprehension in the brain. Our results reveal a connection between human language processing and DLMs, with the DLM's layer-by-layer accumulation of contextual information mirroring the timing of neural activity in high-order language areas.
arXiv Detail & Related papers (2023-10-11T01:03:42Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Visual representations in the human brain are aligned with large language models [7.779248296336383]
We show that large language models (LLMs) are beneficial for modelling the complex visual information extracted by the brain from natural scenes. We then train deep neural network models to transform image inputs into LLM representations.
arXiv Detail & Related papers (2022-09-23T17:34:33Z)
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization. The emotions have extremely varied temporal dynamics. The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.