Fugu-MT 論文翻訳(概要): MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

論文の概要: MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

arxiv url: http://arxiv.org/abs/2603.20194v1
Date: Fri, 20 Mar 2026 17:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:39.290024
Title: MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
Title（参考訳）: MME-CoF-Pro:テキストと視覚的ヒントを用いたビデオ生成モデルにおける推論コヒーレンスの評価
Authors: Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang, Xinyan Chen, Ruichuan An, Ruofan Xing, Jiayi Zhang, Haojie Huang, Pheng-Ann Heng, Jonathan Tremblay, Lawson L. S. Wong,
Abstract要約: MME-CoF-Proはビデオモデルにおける推論コヒーレンスを評価するベンチマークである。 MME-CoF-Proは、視覚論理学から科学的推論まで、16のカテゴリにわたる303のサンプルを含んでいる。
参考スコア（独自算出の注目度）: 95.27042253462963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/
Abstract（参考訳）: ビデオ生成モデルは、出現する推論の振る舞いを示す。生成したイベントが信頼性のあるデプロイメントのためにフレーム間で因果一貫性を保つことが不可欠です。推論コヒーレンス評価の欠如を補うために,ビデオモデルにおける推論コヒーレンスを評価するための総合的ビデオ推論ベンチマークであるMME-CoF-Proを提案する。具体的には、MME-CoF-Proは視覚論理学から科学的推論まで、16のカテゴリにわたる303のサンプルを含んでいる。プロセスレベルの必要な中間推論ステップを評価するための評価基準としてReasoning Scoreを導入し、3つの評価設定を含む。 (a)ヒントなし b) ヒントとヒント (c)視覚的ヒントにより、推論的ヒント指導の基礎となるメカニズムを制御できる。 1)映像生成モデルは、生成品質から切り離された弱い推論コヒーレンスを示す。 2) テキストヒントは明らかな正しさを高めるが, しばしば矛盾や幻覚的推論を引き起こす。 Webサイト: https://video-reasoning-coherence.github.io/

関連論文リスト

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning [14.945921705882725]
この研究はMLLMビデオ理解における知覚と世代間のギャップを埋め、ビデオQAアプリケーションのための解釈可能で忠実な推論パラダイムを提供する。階層的人間の視覚認知に着想を得たClueNetを提案する。
論文参考訳（メタデータ） (2026-03-16T09:15:12Z)
VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
我々は、時間的、構造的、象徴的、空間的、物理的、計画的推論にまたがる16のタスクにまたがる包括的なベンチマークVIPERを紹介する。実験の結果,現状の映像モデルでは約20%のPOC@1.0しか達成できず,良好な結果が得られた。
論文参考訳（メタデータ） (2025-12-31T16:31:59Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
動的ビジュアルコンテンツに対する推論は、大きな言語モデルにとって依然として中心的な課題である。本稿では,時間的精度と推論一貫性を両立させる強化学習手法を提案する。結果のモデルであるVideo R2は、複数のベンチマークでTAC、VAS、精度を一貫して向上させる。
論文参考訳（メタデータ） (2025-11-28T18:59:58Z)
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
我々は、ビデオモデルがゼロショット推論器として機能する準備が整っているかどうかを実証研究する。私たちは、人気の高いVeo-3に注力しています。我々は,空間的,幾何学的,物理的,時間的,具体的論理を含む12次元にわたる推論行動を評価する。
論文参考訳（メタデータ） (2025-10-30T17:59:55Z)
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
我々は視覚中心の推論ベンチマークであるBLINK-Twiceを紹介した。外部の知識に頼るのではなく、私たちのタスクは視覚的コンテンツのみから推論するモデルを必要とします。事前の知覚ベンチマークと比較すると、浅い知覚を超越し、きめ細かい観察と分析的推論を必要とする。
論文参考訳（メタデータ） (2025-10-10T13:14:13Z)
CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [1.6257248483123767]
textbfCoRGI(textbfChain textbffof textbfReasoning with textbfGrounded textbfInsights)は、チェーンオブ思考出力のポストホック検証により、推論信頼性を高めるフレームワークである。
論文参考訳（メタデータ） (2025-08-01T07:17:12Z)
REFINER: Reasoning Feedback on Intermediate Representations [47.36251998678097]
中間推論を生成するための言語モデルを微調整するフレームワークであるREFINERを紹介する。 REFINERは、推論に対する自動フィードバックを提供する批評家モデルと対話することで機能する。経験的評価は、同等のスケールのベースラインLMよりも大幅に改善された。
論文参考訳（メタデータ） (2023-04-04T15:57:28Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。