Fugu-MT 論文翻訳(概要): Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

論文の概要: Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

arxiv url: http://arxiv.org/abs/2510.08138v1
Date: Thu, 09 Oct 2025 12:22:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.061057
Title: Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
Title（参考訳）: 注意力増強によるビデオ言語モデルの時間的論理的整合性の改善
Authors: Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian,
Abstract要約: 大規模言語モデル(LLM)は、しばしば自己矛盾的な出力を生成する。ビデオ言語モデル (Video-LLMs) は論理的に言い換えられた質問に対して一貫した応答を提供しない。本稿では,テンポラリコンディショニング(Temporally Conditioned Attention Sharpening)と呼ばれるアテンション強化手法を提案する。
参考スコア（独自算出の注目度）: 44.654178762186824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.
Abstract（参考訳）: 大規模言語モデル(LLM)は、しばしば自己矛盾的なアウトプットを生成し、信頼性に深刻な影響を与え、実践的なアプリケーションでの採用を妨げる。ビデオ言語モデル(Video-LLMs)では、この現象が研究者の注目を集めている。具体的には、これらのモデルは、基礎となる出力に基づいて、言い換えられた質問に対して論理的に一貫した応答を提供しない。しかし、この現象の根本原因は未解明のままである。本研究では,この現象の潜在的な要因を分析し,統計的に要約し,介入するための解釈可能性駆動型アプローチを採用する。応答の不整合の主な原因の1つは、異なるタイムスタンプ間のビデオトークンを効果的に識別できないことにある。そこで本研究では,時間的条件付き注意強調法 (TCAS) を提案する。この手法は,注意識別に基づく拡張目標を構築し,その時間的解決能力を向上し,時間的理解論理の整合性を向上させる。実験により,本手法はビデオLLMの時間的論理的整合性を大幅に向上させることが示された。さらなる解釈可能性分析により,本手法は注目頭部の時間的識別性を向上し,結論の妥当性を検証した。さらに,時間的論理的整合性が時間的理解のボトルネックであることを強調し,ビデオ時間的接地作業における性能改善を実現する。一貫性を向上させることで,ビデオの時間的理解に大きな進歩をもたらす。

論文の概要: Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

関連論文リスト