Fugu-MT 論文翻訳(概要): Video-LLMs with Temporal Visual Screening

論文の概要: Video-LLMs with Temporal Visual Screening

arxiv url: http://arxiv.org/abs/2508.21094v1
Date: Wed, 27 Aug 2025 14:33:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-01 19:45:10.818941
Title: Video-LLMs with Temporal Visual Screening
Title（参考訳）: 時間的視覚スクリーニングによるビデオLLM
Authors: Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R., Fung, Manling Li, Heng Ji,
Abstract要約: テンポラル・ビジュアル・スクリーニング (TVS) はビデオ質問応答とチューニングデータを処理する新しいタスクである。 TVSは、ビデオインストラクションチューニング(トレーニング)とビデオ質問回答(推論)パイプラインの両方にシームレスに統合可能な、モジュール化されたフロントエンドアダプタタスクとして定式化されている。実験により、TVSを取り入れた場合、相対利得は7.33%(トレーニング)、34.6%(推論)となることが示された。
参考スコア（独自算出の注目度）: 53.89952904971981
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.
Abstract（参考訳）: 人間はプログレッシブバーをドラッグして時間セグメントに焦点を合わせることで時間的スクリーニングを自然に行うが、現在のビデオ大言語モデル(Video Large Language Models, Video-LLMs)は、スパースフレームサンプリングによる微粒な時間的意味を捉えるのに苦労し、トレーニング中にフレーム間推論の監督が不十分である。本研究は,認知科学の確立した原則にヒントを得て,(1)フォーカスクリティカルなビデオセグメントの維持,(2)応答の一貫性を維持しつつクエリを最も直接的な形式に同期的に再構築,(3)応答の不変性と一貫性を維持することによる,ビデオ質問応答とチューニングデータを普遍的に前処理するタスクであるテンポラル・ビジュアル・スクリーニング(TVS)を提案する。 TVSは、ビデオインストラクションチューニング(トレーニング)とビデオ質問回答(推論)パイプラインの両方にシームレスに統合可能な、モジュール化されたフロントエンドアダプタタスクとして定式化されている。 TVSは推論の負担と認知的負荷の分散を最適化し、トレーニング中はクエリを焦点クリティカルな視覚情報と整合させ、推論時にはクエリ対応セグメントのフォーカスとクエリ表現の合理化を可能にする。特に、TVSの最初のベンチマークをキュレートし、競合クエリ書き換え性能を達成しつつ、ビデオトリミングにおけるF-1スコアの0.47という、一見類似したタスクに対する事前のアプローチよりも優れたベースラインであるReSimplifyItを提案する。実験により、TVSを取り入れた場合の相対利得は7.33%(トレーニング)と34.6%(推論)となり、ビデオ言語理解を改善するための時間情報スクリーニングの有効性が示された。

論文の概要: Video-LLMs with Temporal Visual Screening

関連論文リスト