Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
- URL: http://arxiv.org/abs/2510.08138v1
- Date: Thu, 09 Oct 2025 12:22:06 GMT
- Title: Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
- Authors: Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian,
- Abstract summary: Large language models (LLMs) often generate self-contradictory outputs.<n>Video-language models (Video-LLMs) fail to provide consistent responses to logically rephrased questions.<n>We propose an attention enhancement method called Temporally Conditioned Attention Sharpening.
- Score: 44.654178762186824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.
Related papers
- Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding [56.7383554589569]
Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts.<n>We propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework.<n>We show that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
arXiv Detail & Related papers (2025-11-30T09:27:59Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - Causality Matters: How Temporal Information Emerges in Video Language Models [17.570777893613137]
We find that removing or modifying positional encodings in video inputs yields minimal degradation in the performance of temporal understanding.<n>To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model.<n>We propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation.
arXiv Detail & Related papers (2025-08-15T16:33:14Z) - Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning [47.764552063499046]
Large language models (LLMs) have demonstrated significant improvements in contextual understanding.<n>However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace.<n>We introduce a two-stage framework called Learning to Focus (LeaF) to mitigate confounding factors.
arXiv Detail & Related papers (2025-06-09T15:16:39Z) - Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency [59.05753942719665]
We propose a novel temporal robustness benchmark (TemRobBench) to assess the robustness of models.<n>We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments.<n>We design panoramic direct preference optimization (PanoDPO) to encourage LMMs to incorporate both visual and linguistic feature preferences simultaneously.
arXiv Detail & Related papers (2025-05-20T14:18:56Z) - Causality Model for Semantic Understanding on Videos [0.0]
This thesis focuses on the domain of semantic video understanding.<n>It explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA)
arXiv Detail & Related papers (2025-03-16T10:44:11Z) - Interpreting the Repeated Token Phenomenon in Large Language Models [31.1226642501095]
Large Language Models (LLMs) often fail to accurately repeat a single word when prompted to, and instead output unrelated text.<n>We aim to explain the causes for this phenomenon and link it to the concept of attention sinks''<n>Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit.
arXiv Detail & Related papers (2025-03-11T21:40:58Z) - On the Identification of Temporally Causal Representation with Instantaneous Dependence [50.14432597910128]
Temporally causal representation learning aims to identify the latent causal process from time series observations.
Most methods require the assumption that the latent causal processes do not have instantaneous relations.
We propose an textbfIDentification framework for instantanetextbfOus textbfLatent dynamics.
arXiv Detail & Related papers (2024-05-24T08:08:05Z) - From Heuristic to Analytic: Cognitively Motivated Strategies for
Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions.
Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.