Related papers: Causality-Aware Temporal Projection for Video Understanding in Video-LLMs

Causality-Aware Temporal Projection for Video Understanding in Video-LLMs

URL: http://arxiv.org/abs/2601.01804v2
Date: Fri, 09 Jan 2026 03:41:24 GMT
Title: Causality-Aware Temporal Projection for Video Understanding in Video-LLMs
Authors: Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang, Xiaoyu Deng, Ye Zhang,
Abstract summary: V-CORE is a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding.<n>With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU.<n>Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy.
Score: 14.297733965389959
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.

Related papers

TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding [14.570869250170139]
TV-RAG is a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning.<n>By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning.
arXiv Detail & Related papers (2025-12-29T14:10:22Z)
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA [60.86024022291499]
We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming video streams.<n>Our framework generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion.<n>This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference.
arXiv Detail & Related papers (2025-10-29T09:47:38Z)
ResidualViT for Efficient Temporally Dense Video Encoding [66.57779133786131]
We make three contributions to reduce the cost of computing features for temporally dense tasks.<n>First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos.<n>Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model.
arXiv Detail & Related papers (2025-09-16T17:12:23Z)
Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z)
Causality Matters: How Temporal Information Emerges in Video Language Models [17.570777893613137]
We find that removing or modifying positional encodings in video inputs yields minimal degradation in the performance of temporal understanding.<n>To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model.<n>We propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation.
arXiv Detail & Related papers (2025-08-15T16:33:14Z)
LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [10.060267989615813]
We introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding.<n> Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships.
arXiv Detail & Related papers (2025-07-20T01:57:00Z)
FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers [63.788600404496115]
FullDiT2 is an efficient in-context conditioning framework for general controllability in both video generation and editing tasks.<n>FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step.
arXiv Detail & Related papers (2025-06-04T17:57:09Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.