Related papers: Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

URL: http://arxiv.org/abs/2512.21329v1
Date: Wed, 24 Dec 2025 18:58:04 GMT
Title: Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
Authors: Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma,
Abstract summary: We introduce a two-stage experimental pipeline that explicitly separates perception and reasoning.<n>We show that the perception capability is the dominant factor underlying the observed performance gap.<n>Our findings underscore the need for evaluation protocols that disentangle perception from reasoning.
Score: 10.06554565520216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.

Related papers

Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection [85.29900916231655]
Reason-IAD is a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection.<n>Experiments demonstrate that Reason-IAD consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2026-02-10T14:54:17Z)
UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models [44.0727449598399]
We present UReason, a diagnostic benchmark for reasoning-driven image generation.<n>We observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation.<n>Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity.
arXiv Detail & Related papers (2026-02-09T07:17:57Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models [49.32415342913976]
We introduce MM-CoT, a diagnostic benchmark designed to probe the visual grounding and logical coherence of CoT reasoning in multimodal models.<n>We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity.
arXiv Detail & Related papers (2025-12-09T04:13:31Z)
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation [79.17352367219736]
ROVER tests the use of one modality to guide, verify, or refine outputs in the other.<n>ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning.
arXiv Detail & Related papers (2025-11-03T02:27:46Z)
Unifying Deductive and Abductive Reasoning in Knowledge Graphs with Masked Diffusion Model [64.31242163019242]
Deductive and abductive reasoning are critical paradigms for analyzing knowledge graphs.<n>We propose a unified framework for Deductive and Abductive Reasoning in Knowledge graphs, called DARK.<n>We show that DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks.
arXiv Detail & Related papers (2025-10-13T14:34:57Z)
Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning [96.01617809845396]
Ground-R1 is a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations.<n>Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement.
arXiv Detail & Related papers (2025-05-26T17:51:47Z)
Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective [11.013059864022667]
Reasoning Hallucinations are logically coherent but factually incorrect reasoning traces.<n>These errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful.<n>We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits.<n>We also introduce GRPO-R, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping.
arXiv Detail & Related papers (2025-05-19T09:16:40Z)
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z)
Visual Abductive Reasoning [85.17040703205608]
Abductive reasoning seeks the likeliest possible explanation for partial observations. We propose a new task and dataset, Visual Abductive Reasoning ( VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations.
arXiv Detail & Related papers (2022-03-26T10:17:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.