Fugu-MT 論文翻訳(概要): Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

論文の概要: Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

arxiv url: http://arxiv.org/abs/2510.13251v1
Date: Wed, 15 Oct 2025 07:59:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.556164
Title: Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Title（参考訳）: 流れをマップする:ビデオLLMに隠された情報の経路を発見
Authors: Minji Kim, Taekyung Kim, Bohyung Han,
Abstract要約: 機械的解釈可能性を用いたビデオLLMの内部情報フローについて検討する。分析の結果,ビデオQAタスク間の一貫したパターンが明らかになった。これらの発見は、VideoLLMが時間的推論を行う方法の青写真を提供する。
参考スコア（独自算出の注目度）: 42.00309718904487
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint on how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io
Abstract（参考訳）: Video Large Language Models (VideoLLMs) は、ビデオ質問応答 (VideoQA) などのタスクを可能とし、視覚言語モデルの時空間入力に拡張する。近年のVideoLLMの進歩にもかかわらず、ビデオやテキスト情報の抽出と伝播方法に関する内部メカニズムはいまだ解明されていない。本研究では,機械的解釈可能性を用いたビデオLLMの内部情報フローについて検討する。 1)ビデオLLMの時間的推論は、初期から中級層におけるアクティブなクロスフレーム相互作用に始まり、(2)中間層におけるプログレッシブなビデオ言語統合が続く。これは、ビデオ表現と時間的概念を含む言語埋め込みのアライメントによって促進される。 (3) この統合が完了すると、モデルは中間層から後期層の正しい答えを生成する準備ができています。 (4) この分析から, ビデオLLMは, LLaVA-NeXT-7B-Video-FTにおいて, 相当量の注目エッジを抑えながら, 有効な情報経路を選択することで, ビデオQA性能を維持できることが示唆された。これらの知見は、ビデオLLMが時間的推論をどのように行うかについての青写真を提供し、モデル解釈可能性と下流の一般化を改善するための実践的な洞察を提供する。ソースコードのプロジェクトページはhttps://map-the-flow.github.ioで公開されている。

論文の概要: Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

関連論文リスト