Fugu-MT 論文翻訳(概要): REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

論文の概要: REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

arxiv url: http://arxiv.org/abs/2511.13026v1
Date: Mon, 17 Nov 2025 06:25:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:24.719396
Title: REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Title（参考訳）: REVISOR: テキストリフレクションを超えて, 長時間ビデオ理解におけるマルチモーダルイントロスペクティブ推論を目指して
Authors: Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan,
Abstract要約: ロングフォームビデオ理解には、よりリッチでダイナミックな視覚入力が含まれる。純粋にテキストベースのリフレクションメカニズムは、クロスモーダルなインタラクション機能を欠いている。ツール拡張マルチモーダルリフレクションのための新しいフレームワークであるREVISORを提案する。
参考スコア（独自算出の注目度）: 23.684146245231457
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.
Abstract（参考訳）: 純粋にテキストベースの再考プロセスに依存する自己回帰機構は、ほとんどのマルチモーダルタスクでうまく機能する。しかし、長めのビデオ理解のシナリオに直接適用すると、明確な制限が現れる。この根本的な理由は,(1) テキスト情報のみの再考が不十分であり,さらに視覚情報に特化した再考プロセスが必要であること,(2) 純粋にテキストベースのリフレクション機構は相互の相互作用能力に欠けており,反射中の視覚情報の完全統合を妨げていること,の2点にある。ツール強化多モード反射のための新しいフレームワークであるREVISOR(Reflective Visual Segment Oriented Reasoning)を提案する。 REVISORにより、MLLMはテキストと視覚のモダリティをまたいだイントロスペクティブ・リフレクション・プロセスを協調的に構築することができ、長めのビデオ理解のための推論能力を大幅に向上させることができる。強化学習において,REVISORが映像セグメントを高精度にレビューできることを確認するため,Dual Attribution Decoupled Reward(DADR)機構を設計した。 GRPOトレーニング戦略に統合されたこのメカニズムは、モデルの推論と選択されたビデオエビデンスとの間の因果関係を強制する。特に,REVISORフレームワークは,ビデオMME,LongVideoBench,MLVU,LVBenchの4つのベンチマークにおいて,補助的な微調整や外部モデルを必要とすることなく,MLLMの長大なビデオ理解能力を著しく向上させる。

論文の概要: REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

関連論文リスト