Fugu-MT 論文翻訳(概要): VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

論文の概要: VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

arxiv url: http://arxiv.org/abs/2506.11571v1
Date: Fri, 13 Jun 2025 08:27:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-16 17:50:49.716155
Title: VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?
Title（参考訳）: VFaith: 大規模なマルチモーダルモデルは、過去の記憶ではなく、美しい画像に本当に関係があるのか?
Authors: Jiachen Yu, Yufei Zhan, Ziheng Wu, Yousong Zhu, Jinqiao Wang, Minghui Qiu,
Abstract要約: 本稿では,MLLMの視覚的推論能力を評価する最初のベンチマークであるVFaith-Benchを紹介する。 VFaith-Benchは755のエントリを5つの異なるサブセットに分割し、追加の人間ラベルの知覚タスクを含む。
参考スコア（独自算出の注目度）: 34.7828249918764
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model's specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs' reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question's answer. By testing similar questions with images that have different details, the average accuracy reflects the model's visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model's reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.
Abstract（参考訳）: 近年の広範な研究は、長いCoTを導入することで、複雑な問題を解決するMLLMの能力を効果的に向上できることを実証している。しかし、これらのパラダイムの有効性の理由は不明である。モデルが特定の視覚的手がかりを抽出し、推論過程におけるいわゆる推論がパフォーマンス改善にどの程度貢献するかを定量的に分析することは困難である。したがって、MLLMの視覚情報への推論の忠実さを評価することが重要である。この問題に対処するために、まず、GPT-Image-1の助けを借りて、キュー駆動で自動および制御可能な編集パイプラインを提示する。命令に基づいて特定の視覚的手がかりの自動的かつ正確な編集を可能にする。さらに,MLLMの視覚的推論能力を評価する最初のベンチマークであるVFaith-Benchを紹介する。設計したパイプラインを用いて、元の推論問題の解決に不可欠である画像の視覚的手がかりを変更し、質問応答を変化させることにより、比較問合せペアを構築した。類似した質問を細部が異なる画像でテストすることにより、平均的精度はモデルの視覚的推論能力を反映し、テストセット画像の編集前後の精度の違いは、モデルの推論能力と視覚的知覚との関係を効果的に明らかにする。この関係を明らかにするために、さらに具体的なメトリクスを設計しました。 VFaith-Benchは755のエントリを5つの異なるサブセットに分割し、追加の人間ラベルの知覚タスクを含む。我々はVFaith-Bench上で,既存の主流フラッグシップモデルと著名なオープンソースモデルシリーズ/推論モデルの詳細なテストと分析を行い,それらの推論能力の根底にある要因について検討した。

関連論文リスト

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling [0.0]
視覚的質問応答への現在のアプローチは、科学データ解釈に必要な正確さに苦慮することが多い。我々はSciVQA 2025の共有課題に対して,学術論文からの科学的数字に基づく視覚的・非視覚的質問への回答に焦点をあてる。本研究は,視覚的質問応答におけるモデルの性能向上における,迅速な最適化,連鎖推論,アンサンブルモデリングの有効性を裏付けるものである。
論文参考訳（メタデータ） (2025-07-08T17:05:42Z)
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
CoT(Chain-of-Thought)推論は、複数の画像に視覚的手がかりをリンクするために使用することができる。視覚言語モデル(VLM)の規則に基づく強化学習に適応する提案手法は,マルチイメージ推論ベンチマークにおいて大幅な改善を実現し,汎用視覚タスクにおいて高い性能を示す。
論文参考訳（メタデータ） (2025-06-27T17:59:27Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICEは、原画像と編集画像の局所的な差異を検出するために設計されたモデルである。自己監督、塗布ネットワークからの蒸留、全監督を利用する戦略を用いて訓練されている。 DICEは一貫性のある編集を効果的に識別し、異なる編集モデルによって生成された画像を人間の判断と強く相関して効果的に評価する。
論文参考訳（メタデータ） (2025-05-26T18:00:10Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
MLLMの詳細な視覚的理解と空間的推論能力を評価するためのベンチマークであるReasonMapを紹介する。 ReasonMapには、13か国30都市からの高解像度のトランジットマップが含まれており、2つの質問タイプと3つのテンプレートにまたがる1008の質問応答ペアが含まれている。基本および推論変種を含む15種類のMLLMの包括的評価は、直感的パターンを示す。
論文参考訳（メタデータ） (2025-05-24T12:33:52Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
RISEBenchはReasoning-Informed ViSual Editing (RISE)の最初のベンチマークである。 RISEBenchは、時間、因果、空間、論理的推論の4つの主要な推論カテゴリに焦点を当てている。オープンソースモデルとプロプライエタリモデルの両方を含む,9つの目立った視覚編集モデルを評価する実験を行った。
論文参考訳（メタデータ） (2025-04-03T17:59:56Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
本稿では、画像入力を重要な推論段階に移行する戦略であるTake-Allong Visual Conditioning (TVC)を提案する。 TVCは、推論を通して視覚的なコンポーネントへの注意を維持するのに役立つ。提案手法は,5つの数学的推論ベンチマークにおいて,最先端の性能を平均で達成する。
論文参考訳（メタデータ） (2025-03-17T16:45:12Z)
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights [26.85150689408895]
既存のマルチモーダルな数学的モデルでは視覚情報を最小限に活用できることを示す。これは、意図せずモデルに回答を誘導するテキスト情報と回答オプションの優位性に起因している。先行モデルをテストする際、微妙な視覚的差異を検知できないことは、現在の視覚知覚能力の限界を示唆している。
論文参考訳（メタデータ） (2025-03-06T07:29:33Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
MLLM(Multimodal Large Language Models)は、視覚情報とテキスト情報を統合するための強力なツールとなっている。本稿では,MLLMの知覚的理解と抽象的関係推論を評価するためのベンチマークVOILAを紹介する。我々は,現在のMLLMが画像間関係の理解に苦慮し,高レベルの関係推論において限られた能力を示すことを明らかにした。
論文参考訳（メタデータ） (2025-02-25T23:36:19Z)
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data [35.229595049396245]
LMMの認識と説明性を改善するための新しい視覚的拒絶サンプリングフレームワークを提案する。私たちのアプローチは、人間の検証可能な視覚的特徴を含む解釈可能な答えを合成することから始まります。各ラウンドの微調整の後、最高品質の解答を選択するために報酬モデルのないフィルタリング機構を適用する。
論文参考訳（メタデータ） (2025-02-19T19:05:45Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。