Fugu-MT 論文翻訳(概要): Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

論文の概要: Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.01662v1
Date: Sun, 03 May 2026 01:30:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.873562
Title: Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
Title（参考訳）: 映像能動知覚:視覚言語モデルを用いた効果的な推論時間長ビデオ理解
Authors: Martin Q. Ma, Willis Guo, Aditya Agrawal, Ankit Gupta, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency,
Abstract要約: ビデオアクティブ・パーセプション(VAP)は、大規模視覚言語モデル(VLM)を用いた長尺ビデオQAの訓練不要化手法である。 VAPは標準のGPT-4o、Gemini 1.5 Pro Intent、LLaVA-OVメソッドよりも1問あたり5.6倍のフレーム効率向上を実現している。これらの知見は, 映像のフレーム効率と効率を向上させるために, 能動的知覚を活用する可能性を強調した。
参考スコア（独自算出の注目度）: 69.48664694117475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we introduce Video Active Perception (VAP), a training-free method to enhance long-form video QA using VLMs. Our approach treats keyframe selection as data acquisition in active perception and leverages a lightweight text-conditioned video generation model to represent prior world knowledge. Empirically, VAP achieves state-of-the-art zero-shot results on long-form or reasoning video QA datasets such as EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, and CLEVRER, achieving an increase of up to 5.6 x frame efficiency by frames per question over standard GPT-4o, Gemini 1.5 Pro, and LLaVA-OV. Moreover, VAP shows stronger reasoning abilities than previous methods and effectively selects keyframes relevant to questions. These findings highlight the potential of leveraging active perception to improve the frame effectiveness and efficiency of long-form video QA.
Abstract（参考訳）: 大規模視覚言語モデル(VLM)は、ビデオ質問応答(QA)のような高度なマルチモーダルタスクを持つ。しかし、標準の均一サンプリングは高価であり、性能が低下する可能性があるため、VLMはフレームの選択を効果的かつ効率的に行うという課題に直面している。モデルが期待と異なるデータを取得して情報を取得することを示唆する能動知覚理論に触発されて,VLMを用いた長めの映像QA向上のためのトレーニング不要なVAP(Video Active Perception)を導入する。提案手法は,キーフレーム選択をアクティブな知覚におけるデータ取得として扱い,先行世界知識を表現するために,軽量なテキスト条件付きビデオ生成モデルを活用する。実証的に、VAPは、EgoSchema、NExT-QA、ActivityNet-QA、IntentQA、CLEVRERなどのQAデータセットのロングフォームまたは推論ビデオ上で、最先端のゼロショット結果を達成し、標準のGPT-4o、Gemini 1.5 Pro、LLaVA-OVのフレーム毎のフレーム効率を最大5.6倍に向上させる。さらに、VAPは従来の方法よりも強力な推論能力を示し、質問に関連するキーフレームを効果的に選択する。これらの知見は, 映像QAのフレーム効率と効率を向上させるために, 能動的知覚を活用する可能性を強調した。

論文の概要: Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

関連論文リスト