MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
- URL: http://arxiv.org/abs/2601.03331v1
- Date: Tue, 06 Jan 2026 17:45:26 GMT
- Title: MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
- Authors: Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, Zhiqi Huang,
- Abstract summary: We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
- Score: 29.830224745428566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io
Related papers
- Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification [30.86763472476859]
AuditDM is an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence.<n>Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges [85.24983823102262]
We propose a structured methodology for evaluating text-to-image (T2I) models and vision language models (VLMs)<n>We test whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts.<n>Our findings suggest that current metrics are insufficient to capture these nuanced errors.
arXiv Detail & Related papers (2025-12-01T19:46:03Z) - RIV: Recursive Introspection Mask Diffusion Vision Language Model [10.955541881166782]
Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks.<n>These models are unable to correct errors in generated tokens, meaning they lack self-correction capability.<n>We propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability.
arXiv Detail & Related papers (2025-09-28T04:01:46Z) - Measuring Epistemic Humility in Multimodal Large Language Models [17.490955813494693]
We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers.<n>We leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions.<n>HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings.
arXiv Detail & Related papers (2025-09-11T17:54:00Z) - Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability [10.607081850023286]
We introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics.<n>Most models struggle to actively detect flawed textual premises without guidance.<n>These insights underscore the urgent need to enhance LMMs' proactive verification of input validity.
arXiv Detail & Related papers (2025-08-06T02:13:46Z) - MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.<n>Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.<n>We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [79.40678802098026]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.<n>ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.<n>We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Scalable Performance Analysis for Vision-Language Models [26.45624201546282]
Joint vision-language models have shown great performance over a diverse set of tasks.
Our paper introduces a more scalable solution that relies on already annotated benchmarks.
We confirm previous findings that CLIP behaves like a bag of words model and performs better with nouns and verbs.
arXiv Detail & Related papers (2023-05-30T06:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.