UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models
- URL: http://arxiv.org/abs/2602.08336v1
- Date: Mon, 09 Feb 2026 07:17:57 GMT
- Title: UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models
- Authors: Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick,
- Abstract summary: We present UReason, a diagnostic benchmark for reasoning-driven image generation.<n>We observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation.<n>Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity.
- Score: 44.0727449598399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.
Related papers
- ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging [46.06799235021118]
We propose a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification.<n>Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes "Reasoning + X" capabilities.
arXiv Detail & Related papers (2026-01-09T06:19:00Z) - Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation [79.17352367219736]
ROVER tests the use of one modality to guide, verify, or refine outputs in the other.<n>ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning.
arXiv Detail & Related papers (2025-11-03T02:27:46Z) - Unifying Deductive and Abductive Reasoning in Knowledge Graphs with Masked Diffusion Model [64.31242163019242]
Deductive and abductive reasoning are critical paradigms for analyzing knowledge graphs.<n>We propose a unified framework for Deductive and Abductive Reasoning in Knowledge graphs, called DARK.<n>We show that DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks.
arXiv Detail & Related papers (2025-10-13T14:34:57Z) - A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z) - More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models [43.465268635499754]
Test-time compute has empowered large language models to generate extended reasoning chains.<n>As generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors.
arXiv Detail & Related papers (2025-05-23T05:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.