Fugu-MT 論文翻訳(概要): Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

論文の概要: Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

arxiv url: http://arxiv.org/abs/2603.27184v1
Date: Sat, 28 Mar 2026 08:02:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.833759
Title: Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
Title（参考訳）: エゴセントリックビデオ理解モデルにおける時間認識のインセンティブ
Authors: Zhiyang Xu, Tian Qin, Bowen Jin, Zhengfeng Lai, Meng Cao, Lifu Huang, Peng Zhang,
Abstract要約: マルチモーダル大言語モデル(MLLM)は近年,視覚的理解において高い性能を示したが,時間的認識が欠如していることが多い。この欠損は、時間的推論に明示的に報酬を与えず、フレームレベルの空間的ショートカットに依存する訓練目的の一部に起因している。本稿では,MLLMにおける時間的意識を高めるために,検証可能な報酬(RLVR)アルゴリズムを用いた強化学習である時間的グローバルポリシー最適化(TGPO)を提案する。
参考スコア（独自算出の注目度）: 51.40541228498294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have recently shown strong performance in visual understanding, yet they often lack temporal awareness, particularly in egocentric settings where reasoning depends on the correct ordering and evolution of events. This deficiency stems in part from training objectives that fail to explicitly reward temporal reasoning and instead rely on frame-level spatial shortcuts. To address this limitation, we propose Temporal Global Policy Optimization (TGPO), a reinforcement learning with verifiable rewards (RLVR) algorithm designed to incentivize temporal awareness in MLLMs. TGPO contrasts model outputs generated from temporally ordered versus shuffled video frames to derive calibrated, globally normalized reward signals that explicitly favor temporally coherent reasoning. Integrated with GRPO and GSPO, TGPO supports cold-start RL training and effectively suppresses spatial shortcut behaviors learned by existing MLLMs. Experiments across five egocentric video benchmarks demonstrate that TGPO consistently improves temporal grounding and causal coherence, outperforming prior RL-based video reasoning approaches. Our results suggest that TGPO offers a simple and scalable pathway toward temporally robust MLLMs for egocentric video understanding.
Abstract（参考訳）: マルチモーダルな大言語モデル(MLLM)は近年、視覚的理解において高いパフォーマンスを示しているが、時間的認識を欠くことが多く、特に、推論がイベントの正しい順序や進化に依存するエゴセントリックな環境においてである。この欠損は、時間的推論に明示的に報酬を与えず、フレームレベルの空間的ショートカットに依存する訓練目的の一部に起因している。この制限に対処するため,MLLMにおける時間的意識を高めるために,検証可能な報酬付き強化学習(RLVR)アルゴリズムである時間的グローバルポリシー最適化(TGPO)を提案する。 TGPOは、時間的に順序づけられたビデオフレームとシャッフルされたビデオフレームから生成されたモデル出力を対比し、時間的に一貫性のある推論を明示的に好むように調整された、グローバルに正規化された報酬信号を導出する。 GRPOとGSPOを統合したTGPOは、コールドスタートRLトレーニングをサポートし、既存のMLLMで学習した空間的ショートカット動作を効果的に抑制する。 5つのエゴセントリックなビデオベンチマークの実験により、TGPOは時間的接地と因果コヒーレンスを一貫して改善し、以前のRLベースのビデオ推論手法よりも優れていることが示された。以上の結果から,TGPOは時間的に堅牢なMLLMに対して,エゴセントリックなビデオ理解のためのシンプルでスケーラブルな経路を提供する可能性が示唆された。

論文の概要: Incentivizing Temporal-Awareness in Egocentric Video Understanding Models

関連論文リスト