Fugu-MT 論文翻訳(概要): VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

論文の概要: VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2511.18823v1
Date: Mon, 24 Nov 2025 06:57:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-25 18:34:25.061935
Title: VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models
Title（参考訳）: VideoPerceiver:ビデオマルチモーダル大言語モデルにおける微粒時間知覚の促進
Authors: Fufangchen Zhao, Liao Zhang, Daiqi Shi, Yuanjun Gao, Chen Ye, Yang Cai, Jian Gao, Danfeng Yan,
Abstract要約: VideoPerceiverはビデオ理解における微細な認識を高めるビデオマルチモーダル大言語モデル(VMLLM)である。そこで我々は,キャプションからイベントアクションキーワードを抽出し,対応するキーフレームを識別し,隣接するフレームに置き換えることで,キー情報伝達ビデオを構築する。 VideoPerceiverは、詳細なアクション理解とまれなイベントキャプションベンチマークにおいて、最先端のVMLLMを大幅に上回っている。
参考スコア（独自算出の注目度）: 9.896951371033229
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.
Abstract（参考訳）: ビデオ理解における微妙な認識を高める新しいビデオマルチモーダル大言語モデル(VMLLM)であるVideoPerceiverを提案する。 VideoPerceiverは2段階のトレーニングフレームワークを採用している。教師付き微調整(SFT)中に、キャプションからイベントアクションキーワードを抽出し、対応するキーフレームを特定し、隣接するフレームに置き換えることで、キー情報伝達(key-information-missing)ビデオを構築する。テキストトークンを用いてオリジナルおよび修正されたビデオトークンを共同でエンコードし、中間的な視覚表現を補助的なコントラッシブ・ロスによってキーワードと整列させ、微粒なモーションキューに対する感度を高める。強化学習(RL)では、両方のビデオ変種がモデルに入力され、記述が生成され、新しい相対報酬により、劣化した入力からより優れた完全なビデオからの応答が保証され、時間的に正確な動作詳細を復元するようにモデルを明示的に訓練する。また、きめ細かいアクションと過渡的なイベントを含む8万本のビデオのデータセットをキュレートします。実験によると、VideoPerceiverは、標準的なタスクで強いパフォーマンスを維持しながら、きめ細かいアクション理解と稀なイベントキャプションベンチマークにおいて、最先端のVMLLMを大幅に上回っている。タスク関連視覚特徴の優先順位付けにより、細粒度知覚のためのビデオ言語モデルトレーニングを再定義する。

論文の概要: VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

関連論文リスト