Fugu-MT 論文翻訳(概要): Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

論文の概要: Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

arxiv url: http://arxiv.org/abs/2605.07568v1
Date: Fri, 08 May 2026 10:40:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.002601
Title: Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
Title（参考訳）: 時間経過の追跡:ビデオLLMにおける時間的情報フローの診断
Authors: Peitao Han, Fei Cheng, Lis K. Pereira, Qianying Liu, Shigeru Kitazawa,
Abstract要約: フレーム中心のエンコーダは、フレーム中心のエンコーダがそうでないのに対し、ビデオ中心のテンポラリなテンポラリなモデリングによるエンコーダが強いテンポラリなテンポラリなテンポラリなテンポラリなテンポラリなテンポラリな信号をエンコードしている。ビデオ中心の表現が標準のVideo-LLMアーキテクチャに渡すと、パフォーマンスが崩壊し、時間的情報フローのボトルネックが明らかになる。我々は、時間対応ビデオ中心エンコーダ、時間保存プロジェクタ、およびAoT監督を備えたビデオLLMを構築し、AoT$_PPB$での人間のパフォーマンスを98.1%精度で上回る。
参考スコア（独自算出の注目度）: 7.722366540557897
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.
Abstract（参考訳）: アロー・オブ・タイム(AoT)タスクは、時間的不可逆性を認識してビデオが前方または後方で再生されるかどうかを判定するタスクであり、人間がほぼ完璧な精度で解決するが、フロンティアビデオ大言語モデル(Video-LLMs)はわずかに上回っている。視覚的なバックボーンは時間情報のエンコードに失敗するのか、あるいは Video-LLM アーキテクチャの他の部分にはボトルネックがあるのか? 本稿では,視覚エンコーダを Video-LLM から分離し,エンコーダ,プロジェクタ,LLM にまたがる時間情報を追跡することによって,この問題に対処する。フレーム中心のエンコーダは、フレーム中心のエンコーダがそうでないのに対し、ビデオ中心のテンポラリなテンポラリなモデリングによるエンコーダが強いテンポラリなテンポラリなテンポラリなテンポラリなテンポラリなテンポラリな信号をエンコードしている。しかし、ビデオ中心の表現が標準のVideo-LLMアーキテクチャに渡されると、パフォーマンスはしばしば崩壊し、時間的情報フローのボトルネックが明らかになる。 Q-Former は時間保存型 MLP プロジェクションが LLM のそのような情報へのアクセスを大幅に改善する一方で、時間保存型 MLP プロジェクションは時間情報を破壊します。層構造解析により,エンコーダ層間の時間的表現のダイナミクスがさらに明らかになった。これらの結果から,映像中心エンコーダ,時間保存プロジェクタ,およびAoT監督機能を備えたビデオLLMを構築し,AoT$_{PPB}$98.1\%の精度で人的パフォーマンスを上回り,VITATECS-Directionで最大6.0ポイント,TVBenchで1.3ポイントの時間的推論タスクを改善する。以上の結果から,ビデオLLMにおける時間的推論には,効果的な時間的エンコーディングと信頼性の両立が求められることがわかった。

論文の概要: Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

関連論文リスト