Fugu-MT 論文翻訳(概要): A Survey on Video Temporal Grounding with Multimodal Large Language Model

論文の概要: A Survey on Video Temporal Grounding with Multimodal Large Language Model

arxiv url: http://arxiv.org/abs/2508.10922v1
Date: Thu, 07 Aug 2025 08:52:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-24 10:27:26.464106
Title: A Survey on Video Temporal Grounding with Multimodal Large Language Model
Title（参考訳）: マルチモーダル大言語モデルを用いたビデオ時間グラウンドの検討
Authors: Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen,
Abstract要約: 時間的グラウンドリング(VTG)の最近の進歩は、きめ細かいビデオ理解を著しく向上させた。より優れたマルチモーダル理解と推論能力により、MLLM(VTG-MLLM)に基づくVTGアプローチは、従来の微調整手法を徐々に超えつつある。一般的なビデオ言語理解に関する広範な調査にもかかわらず、VTG-MLLMに関する包括的なレビューは乏しいままである。
参考スコア（独自算出の注目度）: 107.24431595873808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.
Abstract（参考訳）: ビデオ時間グラウンドリング(VTG)の最近の進歩は、主にマルチモーダル大言語モデル(MLLM)によって、きめ細かいビデオ理解を大幅に強化した。より優れたマルチモーダル理解と推論能力により、MLLM(VTG-MLLM)に基づくVTGアプローチは、従来の微調整手法を徐々に超えつつある。競争性能を達成するだけでなく、ゼロショット、マルチタスク、マルチドメイン設定の一般化にも優れる。一般的なビデオ言語理解に関する広範な調査にもかかわらず、VTG-MLLMに関する包括的なレビューは乏しいままである。このギャップを埋めるために、この調査は3次元分類学を通してVTG-MLLMに関する現在の研究を体系的に調査する。 1) MLLMの機能的役割は,そのアーキテクチャ的意義を強調している。 2 訓練パラダイム、時間的推論及び課題適応のための戦略の分析、及び 3)時空間表現の有効性を決定する映像特徴処理技術。さらに、ベンチマークデータセット、評価プロトコル、実験結果の要約について論じる。最後に,既存の限界を特定し,有望な研究方向性を提案する。追加のリソースと詳細については、https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.comのリポジトリを参照してほしい。

論文の概要: A Survey on Video Temporal Grounding with Multimodal Large Language Model

関連論文リスト