Related papers: VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

URL: http://arxiv.org/abs/2506.11571v2
Date: Fri, 18 Jul 2025 08:23:14 GMT
Title: VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?
Authors: Jiachen Yu, Yufei Zhan, Ziheng Wu, Yousong Zhu, Jinqiao Wang, Minghui Qiu,
Abstract summary: We introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities.<n>VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task.
Score: 34.7828249918764
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model's specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs' reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question's answer. By testing similar questions with images that have different details, the average accuracy reflects the model's visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model's reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.

Related papers

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling [0.0]
Current approaches to visual question answering often struggle with the precision required for scientific data interpretation.<n>We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.<n>Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
arXiv Detail & Related papers (2025-07-08T17:05:42Z)
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
arXiv Detail & Related papers (2025-06-27T17:59:27Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights [26.85150689408895]
We show that existing multimodal mathematical models minimally leverage visual information.<n>We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers.<n>In testing leading models, their failure to detect subtle visual differences suggests limitations in current visual perception capabilities.
arXiv Detail & Related papers (2025-03-06T07:29:33Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data [35.229595049396245]
We propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs.<n>Our approach begins by synthesizing interpretable answers that include human-verifiable visual features.<n>After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers.
arXiv Detail & Related papers (2025-02-19T19:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.