Fugu-MT 論文翻訳(概要): Reinforcing Consistency in Video MLLMs with Structured Rewards

論文の概要: Reinforcing Consistency in Video MLLMs with Structured Rewards

arxiv url: http://arxiv.org/abs/2604.01460v1
Date: Wed, 01 Apr 2026 23:15:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.097607
Title: Reinforcing Consistency in Video MLLMs with Structured Rewards
Title（参考訳）: 構造的リワードを有するビデオMLLMの整合性強化
Authors: Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang,
Abstract要約: マルチモーダル大言語モデル (MLLM) はビデオ理解において顕著な進歩を遂げている。本研究では,この障害モードを,キャプションを事実的・時間的クレームに分解する構成整合監査を通じて検討する。本研究の目的は,(1)実物,属性,関係性に対する実例対応のシーングラフ報酬,(2)イベントの順序と繰り返しに対する時間報酬,(3)階層的自己検証のためのビデオグラウンド付VQA報酬の3つの相補的な構成要素を統合することである。
参考スコア（独自算出の注目度）: 14.560061824569333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.
Abstract（参考訳）: マルチモーダル大言語モデル (MLLM) はビデオ理解において顕著な進歩を遂げている。しかし、一見可視的なアウトプットは、しばしば視覚的および時間的根拠の不足に悩まされる:モデルは、オブジェクトを作製し、誤った属性を割り当てたり、繰り返しのイベントを崩壊させたりしながら、グローバルに合理的なキャプションや答えを生成したりすることができる。我々は,この障害モードを,キャプションを事実的・時間的クレームに分解する構成整合性監査を通じて検討し,正しいハイレベル予測が実際に有効な低レベル証拠によって裏付けられているかどうかを検証した。私たちのトップダウン監査では、正しいルートリレーショナルなクレームでさえ、信頼できる属性と存在サポートを欠いていることが判明しています。これは、標準文レベルの監視が忠実なビデオ理解の弱いプロキシであることを示している。さらに、アライメントを改善するために強化学習(RL)に目を向ける場合、標準的な文レベルの報酬は、特定の基礎的障害を正確に局在させるには大きすぎることがしばしばある。これを解決するために、実時間単位と時間単位から構築された構造的報酬に、一般的な文レベルの報酬を置き換える。本研究の目的は,(1)実物,属性,関係性に対する実例対応のシーングラフ報酬,(2)イベントの順序と繰り返しに対する時間報酬,(3)階層的自己検証のためのビデオグラウンド付VQA報酬の3つの相補的な構成要素を統合することである。時間的、一般的なビデオ理解、幻覚指向のベンチマークを通じて、この目的はオープンソースのバックボーンに一貫した利益をもたらす。これらの結果は、構造化報酬形成がより忠実なビデオ理解への実践的な経路であることを示唆している。

論文の概要: Reinforcing Consistency in Video MLLMs with Structured Rewards

関連論文リスト