Fugu-MT 論文翻訳(概要): Video Understanding: Through A Temporal Lens

論文の概要: Video Understanding: Through A Temporal Lens

arxiv url: http://arxiv.org/abs/2602.00683v1
Date: Sat, 31 Jan 2026 12:01:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.335543
Title: Video Understanding: Through A Temporal Lens
Title（参考訳）: ビデオ理解:テンポラルレンズで見る
Authors: Thong Thanh Nguyen,
Abstract要約: この論文は、映像要素間の時間的関係を利用して映像理解を促進する方法について、中心的な疑問を提起する。本研究は,(1)大規模視覚言語モデルを用いた自動アノテーションフレームワークと,(2)低データ状態における時間的ダイナミクスを捉えるためのパラメータ効率のよい微調整戦略,(3)高効率な長期ビデオモデリングのためのステートスペースレイヤの統合,(4)動きと映像の微妙な関係を明示的にモデル化する新しいコントラスト学習フレームワークを提示する。
参考スコア（独自算出の注目度）: 5.153774021264937
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.
Abstract（参考訳）: この論文は、映像要素間の時間的関係を利用して映像理解を促進する方法について、中心的な疑問を提起する。既存の手法の限界に対処するため,(1)大規模視覚言語モデルと雑音ロスの対比学習目標を用いた自動アノテーションフレームワーク,(2)低データ状態における時間的ダイナミックスを捉えるために"リカレントアダプタ"を用いたパラメータ効率のよい微調整戦略,(3)効率的な長大なビデオモデリングのためのステートスペースレイヤ(SSL)の統合,(4)動きと映像の微妙な関係を明示的にモデル化する新しいコントラスト学習フレームワーク,(5)大規模視覚言語モデル(LLM)に関する総合的な実証的研究,などが提案されている。これらのコントリビューションは、明示的な時間的モデリングによって、ビデオコンテンツの流動的な性質を表現および推論するモデルの能力が著しく向上することを示す。

論文の概要: Video Understanding: Through A Temporal Lens

関連論文リスト