Fugu-MT 論文翻訳(概要): Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

論文の概要: Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

arxiv url: http://arxiv.org/abs/2606.18441v1
Date: Tue, 16 Jun 2026 19:42:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.875985
Title: Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs
Title（参考訳）: インターセクションとしての推論:ビデオMLLMにおける視覚的フォーカスのためのコンセンサスフレームアライメント
Authors: Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua,
Abstract要約: 強化学習は、大規模言語モデルの推論能力を改善した。結果のみの報酬をビデオマルチモーダルな大規模言語モデルに適用することは、どの視覚的証拠が答えを支持するべきかを限定的なガイダンスを提供する。本稿では,エビデンス対応ビデオ推論のための時間アノテーションフリープロセスレベル報酬フレームワークであるConsensus Frame GRPOを紹介する。
参考スコア（独自算出の注目度）: 81.04673240949074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.
Abstract（参考訳）: 強化学習は、大規模言語モデルの推論能力を向上させるが、結果のみの報酬をビデオマルチモーダル大言語モデル(ビデオMLLM)に適用することで、視覚的証拠がどの答えをサポートするべきかを限定的なガイダンスを提供する。コンセンサスフレームGRPO(Consensus Frame GRPO, CF-GRPO)は, 時間的アノテーションのないプロセスレベルの報酬フレームワークである。 CF-GRPOは、時間的カバレッジ、シーン遷移キュー、クエリ条件付き視覚関連性を含む、本質的なビデオキューに先立ってコンセンサスフレームを構築する。次に、視覚的および反応的表現からモデル側フレーム使用スコアを計算し、Consensus Frame Reward (CFR) を通じてそれらの合意を最適化する。塩分を意識したスパースアグリゲーションと分布のシャープ化により、CFRは人間の時間的アノテーションを必要としない高いコントラスト報酬信号を提供する。実験により、ビデオCFRは複雑なビデオ推論ベンチマーク間での競争性能を達成し、ビデオMLLMおよびRLベースラインよりもいくつかの指標を改善し、一方、コンセンサスはトレーニング中に強調されたエビデンスフレームの解釈可能なビューを提供する。実装はhttps://github.com/1Pansy/VideoCFRで公開されている。

論文の概要: Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

関連論文リスト